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ABSTRACT 


Ching-Chih  Hsiao,  Ph.D.,  Purdue  University,  December  19B2.  Highly  Parallel 
Processing  of  Relational  Databases.  Major  Professor:  Lawrence  Snyder. 

New  computer  architectures  are  feasible  because  of  the  advances  in 
VLSI  design  and  fabrication  technologies.  Among  them,  highly  parallel 
structures  coordinate  hundreds  of  thousands  of  processing  elements  that 
function  cooperatively.  These  structures  are  especially  useful  in  solving 
computationally  intensive  problems.  This  thesis  applies  the  highly  parallel 
approach  to  improve  the  efficiency  in  processing  relational  database 
queries.  High-performance  algorithms  for  basic  relational  operations  are 
explored.  Efficient  composition  of  these  algorithms  to  process  whole  queries 
is  also  investigated.  ^r~'  . 

Regularity  and  uniformity  are  necessary  in  order  to  make  the  highly 
parallel  computing  cost-effective.  An  efficient  primitive,  called  POP-SORT,  is 
proposed  to  unify  the  relational  operations  such  as  sorting,  duplicate- 
removal,  union,  intersection,  and  difference.  The  three  latter  operations  are 
even  allowed  to  have  multisets  as  operands.  POP-SORT  is  based  on  an  easy 
scheme  which  adapts  any  highly  parallel  and  regular  sorting  algorithm  to 
perform  all  these  database  operations.  The  primitive  is  compared  favorably, 

This  work  13  part  of  the  Blue  CHiP  Project.  It  is  supported  in  part  by  the  Office  of  Naval 
Research  Contracts  N00014-80-K-0816  and  N00014-81-K-0380.  The  latter  is  Task  SRO-IOO. 


in  terms  of  time  complexity,  with  existing  algorithms  for  the  five  operations. 
The  optimality  of  POP-SORT  is  also  proved  for  a  restricted  but  reasonable 
type  of  parallel  computation.  Furthermore,  sublinear  time  performance  is 
possible  for  join  operations  if  argument  relations  are  preconditioned  by 
POP-SORT. 

For  processing  a  whole  query,  the  operation  tree  parsed  from  the  query 
can  be  executed  by  composing  individual  algorithms  for  the  operations.  The 
Configurable,  Highly  Parallel  (CHiP)  computers  have  the  flexibility  to  provide 
programmable  processor  interconnections  for  composing  algorithms.  Query 
embedding  is  a  method  of  executing  whole  operation  trees  to  explore  max¬ 
imum  parallelism  on  the  CHiP  computers.  It  involves  the  processor  alloca¬ 
tion  and  the  embedding  of  appropriate  interconnections.  With  the  bitonic 
POP-SORT,  which  is  a  generalization  of  Batcher’s  bitonic  merge  sort,  the 
query  embedding  can  be  simplified  significantly. 


CHAPTER  1 


INTRODUCTION 


Computer  architects  have  been  attempting  to  avoid  the  von  Neumann 
structure  that  a  single  CPU  serially  fetches,  processes,  and  restores  data 
items.  Due  to  the  advances  of  VLSI  fabrication  and  design  technologies, 
computer  architectures  are  no  longer  strictly  confined  by  the  cost  of  com¬ 
puting  hardware.  In  the  near  future  it  will  be  feasible  to  implement  highly 
parallel  computers  consisting  of  hundreds  of  thousands  of  processing  ele¬ 
ments  [Hayn82].  With  the  use  of  so  many  processing  elements  operating 
cooperatively,  a  speed-up  ratio  as  substantial  as  many  orders  of  magnitude 
is  possible. 

The  highly  parallel  structures  are  known  to  be  useful  for  solving  some 
computationally  intensive  problems  in  the  areas  like  meteorology,  cryptog¬ 
raphy,  image  processing,  ...etc.  However,  integrating  many  processing  ele¬ 
ments  to  implement  a  reliable  and  cost-effective  system  is  extremely 
difficult.  Problems  suitable  for  highly  parallel  computing  must  show  a  high 
degree  of  regularity  and  uniformity. 

Relational  data  model  [Codd70]  not  only  provides  a  simple  view  of  data¬ 
bases  but  also  calls  for  a  particular  feature  named  relational  processing 
capability  [Codd82].  This  feature  entails  the  definition  of  relational 


operations  which  treat  whole  relations  as  operands.  It  is  of  interest  to  study 
the  application  of  highly  parallel  architectures  and  algorithms  to  the  imple¬ 
mentation  of  relational  operations. 


Historically,  efficiency  of  database  processing  has  been  stressed,  but 
convenience  and  expressiveness  have  been  of  less  concern.  Application  pro¬ 
grammers’  productivity  is  thus  far  behind  the  demands  from  end  users  of 
database  systems.  A  relational  data  model,  by  raising  the  user  interface 
from  physical  details  to  a  higher  logical  level,  provides  improved  conveni¬ 
ence  and  expressiveness.  E.  F.  Codd  [Codd82]  also  remarked  that  the  rela¬ 
tional  processing  capability  is  a  key  factor  leading  the  relational  model 
toward  a  practical  foundation  for  improved  productivity.  It  is  therefore  very 
important  to  implement  a  relational  processing  capability  that  achieves 
high  performance. 

1.1  Goal  and  Methodology 

The  goal  of  this  work  is  to  take  advantage  of  the  VLSI  computation 
power  and  the  highly  parallel  architectures  to  improve  relational  database 
processing.  We  are  concerned  both  with  high-performance  implementations 
of  individual  relational  operations  and  efficient  processing  of  whole  queries. 

Highly  parallel  computing  relies  crucially  on  efficient  communication  to 
achieve  a  successful  exploitation  of  parallelism.  For  solving  problems  with 
parallel  computation,  more  communication  time  is  often  required  than  the 
actual  computation  time  [LintBl].  Processor  interconnections,  hardwired  or 
software-controlled,  on  highly  parallel  computers  are  usually  selected  to 
support  efficient  communication.  Therefore,  it  is  important  to  identify 


communication  schemes  which  are  efficient  for  solving  many  problems. 

Sorting  is  a  necessary  operation  in  many  applications.  Highly  parallel 
sorting  has  been  vigorously  studied  and  several  efficient  algorithms  exist 
[Batc68,  Ston71,  Thom77,  Nass79;  Mull75.  Hirs78,  Prep78].  For  highly  paral¬ 
lel  processing  of  relational  databases,  we  unify  several  operations  on  a  single 
communication  scheme  by  reducing  those  operations  to  sorting.  The  primi¬ 
tive  operation  POP-SORT  (Primitive  Operation  SORT)  is  thus  proposed  for  the 
database  operations  such  as  sorting,  union,  intersection,  difference,  and 
duplicate-removal.  We  also  apply  POP-SORT  to  solve  join  operations  in  sub- 
linear  time. 

POP-SORT  presents  the  possibility  of  adapting  any  sorting  algorithm  to 
become  a  primitive  for  the  five  database  operations.  For  merge-oriented 
sorting  methods,  the  adaptation  can  be  easily  done  by  replacing  the  simple 
comparison  function  with  a  slightly  modified  one.  The  simple  comparison 
function  is  extended  to  have  marking  capability  that  marks  one  of  the  two 
argument  items  when  they  are  found  to  be  equal.  Comparison  functions  act¬ 
ing  only  in  the  local  computation  at  processing  elements  do  not  effect  the 
communication  among  the  processing  elements  at  all.  For  sorting  methods 
m  general,  the  adaptation  can  be  done  by  two  marking  processes  that  both 
take  constant  time.  The  marking  processes  require  communication  only  as 
simple  as  a  linear  array. 

The  efficiency  of  POP-SORT  in  performing  the  five  database  operations  is 
demonstrated  by  an  instance  called  the  bitonic  POP-SORT.  It  is  a  generaliza¬ 
tion  of  Batcher's  bitonic  merge  sort  [Batc6B].  The  performance  of  the 
bitonic  POP-SORT  compares  favorably  with  existing  algorithms  (upper 


bounds)  for  the  five  database  operations.  To  further  evaluate  the  optimality 
of  POP-SORT,  we  look  into  the  reducibility  relationships  between  it  and  the 
database  operations. 

The  CHiP  (Configurable  Highly  Parallel)  computers  are  capable  of  pro¬ 
viding  dynamic  and  programmable  interconnections  [SnydB2].  It  is  thus 
possible  to  embed  required  connections  for  processing  whole  queries.  To 
expand  the  spectrum  of  parallelism  to  process  whole  queries,  we  explore  the 
feasibility  of  the  query  embedding  on  the  CHiP  computers.  In  [SnydB2] 
Snyder  showed  that  the  CHiP  computers  have  the  flexibility  to  compose 
algorithms  to  solve  large  and  computationally  intensive  problems.  Employ¬ 
ing  the  bitonic  POP-SORT  as  a  primitive  for  several  database  operations,  the 
composition  of  algorithms  to  process  whole  queries  can  be  simplified 
significantly. 

1.2  Definitions  and  Notation 

A  relation  is  normally  a  set  of  unique  tuples  and  each  tuple  consists  of 
an  ordered  sequence  of  components.  As  duplicates  are  artifacts  of  certain 
relational  operations,  we  allow  relations  to  be  multisets  consisting  of  dupli¬ 
cate  tuples.  Basic  relational  operations  like  sorting,  restriction  (selection), 
join,  Cartesian  product,  and  quotient  are  defined  as  those  in  text  books  (see, 
for  example,  [UllmBO]).  Projection,  duplicate-removal,  union,  intersection, 
and  difference  are  defined  slightly  differently  in  this  work. 

For  remove  -duplicates  we  do  not  insist  on  discarding  the  duplicate 

items.  Given  n  data  items  x0,  zx . xn_lt  the  goal  of  duplicate-removal  is  to 

compute  the  mark  bits  /x(0),  ^i(l) . /i(n-l)  for  these  items.  In  the 
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sequence  x0M<0),x1M(1)...,xn_1#**n_1).  we  distinguish  xt^<)  as  a  duplicate  item 
if  fi(i)=l.  An  additional  operation  segregation  can  be  used  to  pack  and 
separate  marked  and  unmarked  data  items  in  the  sequence  [SchwBO].  To 
perform  projection  on  a  relation,  we  assume  that  duplicate-removal  is  not 
automatically  invoked.  The  operations  union,  intersection,  and  difference 
may  be  relaxed  to  allow  multisets  as  operands.  Without  further  notice,  they 
are  just  set  operations  as  usual. 

A  highly  parallel  processor  is  a  processing  device  which  integrates 
many  processing  elements.  By  "processor”  we  may  refer  to  a  single  pro¬ 
cessing  element  or  a  system  of  coordinated  processing  elements.  Usually  it 
means  a  processing  element  unless  further  indicated  by  the  context.  For 
example,  the  CHiP  "processor”  is  a  "highly  parallel  processor"  in  the  collec¬ 
tive  sense. 

The  following  notation  is  used  throughout  this  thesis, 
log*?/  the  base  two  logarithm  (logzy)*. 

fx  1  the  least  integer  greater  than  or  equal  to  x . 

lx  J  the  greatest  integer  less  than  or  equal  to  x. 

tjf  time  required  for  one  data  routing  step. 

tc  time  required  for  one  comparison  step. 

PE  processing  element  which  may  have  some  local  memory. 


AkJB 


the  union  of  two  sets  A  and  B . 


AC^B 


the  intersection  of  two  sets  A  and  B. 


A-B  the  difference  of  two  sets  A  and  B. 

union(A,B)  the  union  of  two  multisets  A  and  B. 

inter (A,B)  the  intersection  of  two  multisets  A  and  B . 

differ  (A, B)  the  difference  of  two  multisets  A  and  B. 

rmdup(A)  the  duplicate-removal  on  multiset  A. 

1.3  Organization  of  the  Thesis 

In  Chapter  2,  we  look  at  the  conventional  approaches  of  database 
machine  designs.  The  conventional  approaches  do  not  solve  the  compute- 
bound  operations  satisfactorily.  Several  highly  parallel  structures  for  solv¬ 
ing  the  compute-bound  operations  are  thus  proposed  by  researchers.  We 
also  discuss  those  structures  and  the  algorithms  proposed  to  be  executed 
on  them. 

Chapter  3  presents  a  methodology  to  apply  parallel  sorting  to  solve 
other  problems.  By  reducing  union,  intersection,  difference,  and  duplicate- 
removal  to  sorting,  these  operations  are  unified  by  the  primitive  operation 
POP-SORT.  Two  adaptations  are  shown  to  extend  merge-oriented  and  other 
sorting  methods  to  become  POP-SORT.  The  adaptation  overhead  is  shown  to 
be  negligible.  We  also  show  that  POP-SORT  can  be  used  to  perform  join 
operations  in  sub-linear  time.  This  application  of  POP-SORT  is  especially 


suitable  for  easy  join  operations  that  produce  only  small  result  relations. 

The  efficiency  of  POP-SORT  is  investigated  in  Chapter  4.  A  complexity 
hierarchy  showing  the  reducibility  relationships  among  POP-SORT  and  the 
five  database  operations  is  first  established.  The  complexity  hierarchy  indi¬ 
cates  that  the  optimality  of  POP-SORT  relies  on  the  reducibility  of  sorting  to 
duplicate-removal.  We  therefore  look  into  the  reducibility  of  sorting  to 
duplicate-removal  by  considering  two  types  of  comparison  functions,  the 
weak  comparison  (=,  /)  and  the  strong  comparison  (<,  =,  >). 

Chapter  S  deals  with  some  interesting  aspects  of  performing  the  bitonic 
sort  with  the  mesh  interconnection  on  the  CHiP  computers.  We  design  an 
efficient  algorithm  that  rearranges  n  sorted  data  items  among  three  major 
indexing  schemes  in  less  than  (3>/n  )ts  time.  Sorting  with  shadow  regions  is 
a  technique  that  allows  the  allocation  of  exactly  n  processing  elements  for 
sorting  n  data  items  (n  is  an  arbitrary  integer).  We  also  demonstrate  how 
data  communication  can  be  improved  by  properly  programming  the  switch¬ 
ing  elements  on  the  CHiP  computers.  Two  different  methods  of  sorting  k*n 
data  items  on  a  CHiP  region  of  n  processing  elements  are  also  analyzed. 

Processing  whole  queries  on  the  CHiP  computers  is  the  subject  of 
Chapter  6.  Relational  algebraic  queries  are  considered.  The  idea  of  embed¬ 
ding  whole  operation  trees  parsed  from  database  queries  is  explored.  With 
the  bitonic  POP-SORT,  we  demonstrate  that  query  embedding  is  simplified 
significantly.  We  also  discuss  several  optimization  strategies  to  improve 
query  embedding  on  the  CHiP  computers. 


CHAPTER  2 


HIGHLY  PARALLEL  DATABASE  MACHINES 


Database  machines  are  specialized  computers  dedicated  to  executing 
database  management  functions.  They  are  usually  connected  to  general- 
purpose  computers  as  back-end  machines.  If  a  database  machine  is 
enhanced  with  a  highly  parallel  processor  to  solve  compute-bound  database 
operations,  we  call  it  a  highly  parallel  database  machine.  In  Figure  2-1  we 
show  the  configuration  of  a  back-end  system  consisting  of  a  highly  parallel 
database  machine. 

In  the  back-end  system,  the  host  computer  acts  as  the  interface 
between  users  and  the  database  machine.  It  is  responsible  for  taking  users’ 
requests,  translating  the  high-level  data  manipulation  programs  into  data¬ 
base  machine  commands,  instructing  the  database  machine  to  perform  the 
commands,  and  returning  the  response  to  the  users.  Besides  the  highly 
parallel  processor,  there  are  two  major  components  in  the  database 
machine:  the  back-end  controller  and  the  mass  storage.  The  back-end  con¬ 
troller  serves  as  the  interface  to  the  host  computer.  The  mass  storage  is 
content  addressable  in  order  to  perform  searching  and  update  operations  as 
well  as  other  I/O-bound  database  operations  efficiently.  Between  the  mass 
storage  and  the  highly  parallel  processor  there  is  a  wide  data  channel  to 
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support  rapid  data  loading  and  unloading.  This  bandwidth  is  also  needed  in 
associative  processor  systems  [Berr79]  and  array  processor  systems 
[Batc80]. 


Users ■ 


Database 
Mach i ne 


Back-end 
Control  I er 


(  "Content  \ 
^addressable'] 
\  Mass  j 
\  Storage  / 


Highly 
Para  I  lei 
Processor 


Figure  2-1.  The  system  configuration  of  highly  parallel 
database  machines. 


This  chapter  presents  a  brief  overview  of  the  principal  approaches  in 
conventional  database  machine  designs.  The  inability  of  conventional 
approaches  to  solve  compute-bound  database  operations  is  discussed. 
Highly  parallel  processors  are  then  proposed  as  a  means  of  extending  the 
computation  power  of  database  machines.  Next,  we  review  some  highly 


parallel  structures  and  their  algorithms  that  have  been  reported  to  be  use¬ 


ful  for  database  applications.  All  this  serves  as  a  benchmark  for  evaluating 
our  research  work. 
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2.1  Background 

As  database  management  techniques  are  shown  to  be  helpful,  users 
want  them  to  be  larger  and  more  inclusive.  But  as  databases  become  pro¬ 
gressively  larger,  conventional  general-purpose  computers  fail  to  meet  the 
response  time  requirements  of  many  applications.  With  the  adoption  of 
high-level  data  models  and  data  manipulation  languages,  high-performance 
implementation  of  database  management  systems  becomes  even  more  cru¬ 
cial.  Two  well-known  implementations  of  relational  database  management 
systems,  System  R  [Astr76]  and  INGRES  [StoB76],  amply  demonstrate  the 
complexity  and  difficulty  of  query  processing  under  these  circumstances. 

Since  software  techniques  on  conventional,  general-purpose  computers 
cannot  implement  database  management  systems  efficiently  enough, 
researchers  have  turned  to  alternative  computer  architectures  and  special- 
purpose  hardware.  Canaday  [Cana74]  proposed  that  database  management 
functions  be  placed  on  a  dedicated  back-end  processor  which  has  exclusive 
access  to  the  database.  By  limiting  the  back-end  processor  to  the  perfor¬ 
mance  of  only  database  management  functions,  it  can  have  the  advantage  of 
efficiency  through  specialization.  But  the  implementation  of  the  experimen¬ 
tal  Database  Management  System  (XDMS)  [Cana?4]  failed  to  show  that  the 
use  of  a  general-purpose  computer  as  back-end  is  a  good  approach.  Special¬ 
ized  database  machines  are,  therefore,  designed  to  serve  as  the  back-end 
computers  [Bane79,  DeWi?9,  Schu79], 

Many  hardware  organizations  have  been  proposed  to  facilitate  database 
processing  although  they  are  not  all  complete  designs  of  database  machines. 
Two  objectives  are  involved.  One  is  to  improve  the  non-query  aspects  of 


processing  such  as  searching,  retrieval,  insertion,  deletion,  and 
modification.  The  other  is  to  speed  up  the  query  aspects  of  processing 
which  may  involve  some  compute-bound  operations. 

There  is  a  consensus  that  content  addressable  memory  is  desirable  for 
efficient  searching  and  updating.  But  storing  databases  entirely  in  associa¬ 
tive  memory  is  infeasibly  expensive.  Fortunately,  the  "logic-per-track1' 
approach  proposed  by  Slotnick  [Slot?0]  provides  a  practical  solution  for 
implementing  a  large-volume  memory  with  content  addressability.  Many 
designs  have  applied  some  type  of  the  logic-per-track  approach  to  achieve 
the  associativity  and  parallelism  for  fast  searching  and  updating  [Lang7B], 
Among  them  are  the  Content-Address  Segment  Sequential  Memory  (CASSM) 
[Su75,  Su79],  the  Content  Addressed  File  Store  (CAFS)  [Babb79],  the  Data 
Base  Computer  (DBC)  [Bane78,  Bane79],  the  Relational  Associative  Processor 
(RAP)  [0zka75,  Schu79],  and  the  Rotating  Associative  Relational  Store 
(RARES)  [Lin76]. 

One  useful  strategy  to  reduce  the  overhead  of  data  movement  is  to  pro¬ 
cess  data  in  place  if  it  is  possible.  By  placing  some  processing  capability  at 
the  mass  storage  level,  the  logic-per-track  approach  performs  not  only 
searching  and  updating  effectively  but  other  operations  as  well.  I/O-bound 
relational  operations  like  restriction  and  projection  (without  removing  dupli¬ 
cates)  can  be  performed  at  the  memory  level.  Other  operations,  however, 
are  not  easily  supported  [Song81,  DeWi82].  Sorting,  duplicate-removal, 
union,  intersection,  difference,  join,  and  Cartesian  product  all  require  that 
one  data  item  interact  with  many  others.  These  operations  require  complex 
processor  interconnections  that  cannot  be  easily  implemented  using  the 


logic-per-track  approach,  This  is  because  of  the  physically  dispersed  char¬ 
acter  of  the  read/write  heads.  Implementing  these  operations  on  the  secon¬ 
dary  storage  level,  it  seems  to  require  some  kind  of  looping  or  iteration. 


Several  techniques  help  to  improve  query  processing  on  compute-bound 
operations.  The  overhead  incurred  by  the  time-consuming  secondary 
memory  accesses  can  be  reduced  by  using  intelligent  file  systems  and 
memory  management.  Unnecessary  database  information  can  be  filtered 
out  before  it  is  submitted  to  the  processor.  The  use  of  special  processing 
devices  is  yet  another  weapon  with  which  researchers  attack  the  compute- 
bound  problems.  Much  special-purpose  hardware  has  been  proposed  for 
performing  the  operations  join  and  sorting.  In  addition,  in  the  DBC  design 
several  compute-bound  functions  or  "post-processing  functions"  [HsiD79] 
are  performed  by  a  multiprocessor  system.  These  post-processors  are 
linearly  connected,  and  each  has  its  own  local  memory.  Also  in  [DeWi79]  a 
multiprocessor  architecture  called  DIRECT  was  designed  to  support  rela¬ 
tional  query  processing. 

Special  hardware  for  a  few  operations  respectively  do  not  solve  the 
problem  completely  or  uniformly.  The  multiprocessor  systems  proposed 
demonstrate  reasonably  good,  but  restricted,  performance  improvement. 
Application  of  highly  parallel  processors  has  thus  been  proposed  for  data¬ 
base  processing  [KungBO,  SongBO,  HsiCBl,  LehmBl]. 


2.2  Highly  Parallel  Processors 

A  highly  parallel  processor  may  consist  of  hundreds  of  thousands  of 
processing  elements  which  function  cooperatively  to  solve  compute-bound 
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problems.  The  computation  power  of  the  processing  elements  is  limited  to 
that  required  by  database  management  queries.  The  instruction  set  is  thus 
small  and  can  be  tuned  to  perform  query  processing  more  efficiently.  When 
the  highly  parallel  processor  is  implemented  by  VLSI  chips,  less  area  for 
computing  logic  implies  that  more  area  can  be  dedicated  to  the  local 
memory  logic  or  the  processor  interconnection  circuitry.  Being  more 
important,  a  larger  scale  integration  of  processing  elements  i9  possible  if 
more  chip  area  is  available  for  processor  interconnections. 

In  highly  parallel  structures,  inter-processor  communication  is  the  key 
to  successful  exploitation  of  the  available  computing  power.  The  processor 
interconnection  problem  has  motivated  much  research  recently.  An  impor¬ 
tant  question  that  needs  to  be  addressed  for  general  computation  and  data 
processing  alike  is: 

What  are  the  most  effective  interconnection  paths  for  communicat¬ 
ing  PEs  to  process  database  queries ? 

This  section  discusses  several  structures  of  highly  parallel  processors  and 
their  algorithms.  The  highly  parallel  processors  addressed  here  are:  the 
systolic  array  system,  the  double  tree  machine,  the  Ultracomputer,  and  the 
CHiP  computer.  The  first  three  represent  different  processor  interconnec¬ 
tions.  and  the  last  one  has  the  flexibility  to  provide  them  (as  well  as  the 
mesh  interconnection). 

In  Table  2-1  we  first  summarize  the  time  complexities  of  certain  data¬ 
base  operations  on  these  machines.  POP-SORT  is  the  primitive  operation 
proposed  in  this  thesis  which  can  perform  the  other  five  operations  (Chapter 
3).  The  complexity  is  measured  by  assuming  that  the  argument  relations 
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have  n  tuples.  Except  for  the  systolic  arrays  and  the  tree  machine,  we 
assume  that  the  data  is  already  in  the  processing  device.  The  effect  of  pro¬ 
pagation  delay  is  ignored  here  for  the  tree  machine  and  the  Ultracomputer. 


Table  2-1.  Algorithms  of  database  operations  on 
highly  parallel  machines. 


operations 

u 

n 

— 

rmdup 

sort 

POP-SORT 

Systolic  arrays 

O(n) 

0(n) 

0(n) 

0(n) 

- 

- 

Tree  machine 

8(n) 

8  (n) 

0  (n) 

8(n) 

e(n) 

8(n) 

Mesh  computer 

- 

- 

- 

- 

8  (y/n) 

e(V^) 

Ultracomputer 

CHiP 

OflogSn) 

0(108*71) 

Oflog^) 

• 

0(log*7i) 

OQog^) 

»<^** 

*  An  instance  of  POP-SORT,  the  bitonic  POP-SORT,  is  used  to  calcu¬ 
late  the  time  complexities  (Chapter  3.1). 

**  A  technique  is  applied  on  the  CHiP  computers  to  achieve  the 
speed-up  factor  s  over  the  mesh-connected  computers  (Section 
5.3),  where  s  <  w*c  . 


Systolic  Arrays 

Systolic  arrays  have  been  proposed  for  many  applications  [Kung79, 
FostBO,  Kung82].  Kung  and  Lehman  [KungBO]  used  systolic  arrays  to  imple¬ 
ment  relational  database  operations.  Lehman  [LehmBl]  also  applied  systolic 
arrays  to  processing  simple  queries. 

They  presented  two  types  of  systolic  arrays  to  implement  database 
operations  (Figure  2-2).  A  two-dimensional  comparison  array  and  a  one¬ 
dimensional  accumulation  array  were  used  for  union,  intersection, 
difference,  and  duplicate-removal,  The  comparison  array  alone  is  used  for 
join  operations.  Argument  relations  are  "staged"  into  the  comparison  array 


in  a  component-parallel  and  tuple-serial  fashion.  Tuples  from  different  rela¬ 


tions  flow  in  the  opposite  directions  in  the  comparison  array  so  that  they  will 
always  pass  by  each  other.  The  comparison  results  move  from  left  to  right. 
They  are  recorded  as  a  bit  matrix  for  join  or  shifted  to  the  accumulation 
array  to  generate  a  bit  string  for  the  other  operations. 


relation  A 


Campar i son 
Array 


relation  B 


Accumulation 

Array 


resu ! t 


Figure  2-2.  The  systolic  array  system  for  performing 
database  operations. 


In  the  systolic  arrays,  the  processing  elements  perform  only  simple 
functions  and  the  interconnections  are  very  regular.  Both  of  the  arrays  cam. 
be  implemented  with  only  a  few  types  of  simple  cells.  Another  advantage  is 
that  computations  are  pipelined  elegantly  so  that  the  processing  time  is 
completely  overlapped  with  the  I/O  time.  However,  from  an  algorithmic 
point  of  view,  the  benefit  of  data  ordering  is  totally  ignored  in  [KungBO].  The 
systolic  arrays  are  fundamentally  structures  of  linear  time  performance. 

Systolic  arrays  are  algorithmically  specialized  processors  [Snyd82].  The 
functions  performed  by  systolic  arrays  are  predetermined  and  rigidly 
manufactured  into  VLSI  products.  Programmability  is  minimal.  To  imple¬ 


ment  all  the  operations  required  by  query  processing,  an  integrated  system 
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containing  several  systolic  arrays  is  needed  [SongBl]. 


Hie  BK-tree  Machine 

The  BK-tree,  or  double  tree,  was  proposed  by  Bentley  and  Kung 
[Bent79]  for  pipelining  searching  operations  such  as  retrieval,  insertion, 
deletion,  and  modification.  On  an  n-processor  version  of  this  machine,  a  set 
of  n  data  items  can  be  maintained  such  that  all  the  searching  problems  are 
processed  in  21ogn  steps.  Tree -structured  machines  have  also  been  pro¬ 
posed  as  general-purpose  processing  devices  by  Browning  [BrowBO]  and 
many  other  researchers.  In  [SongBO,  Song8l]  this  architecture  was  applied 
to  implement  additional  basic  database  functions. 


input  root  node 


Figure  2-3.  The  BK-tree  machine. 


A  BK-tree^  machine  is  composed  of  three  kinds  of  processing  elements: 
O-nodes,  []-nodes,  and  V-nodes  (see  Figure  2-3).  The  []-nodes  contain  the  data 
items  to  be  processed.  The  O-nodes  are  responsible  for  broadcasting 
sequences  of  instructions  and  data  to  the  []-nodes.  Parallel  computation  is 
carried  out  by  the  [1-nodes.  Partial  results  produced  are  then  collected  by 

t  An  interesting  interpretation  of  "BK’’  is  that  "B”  is  mnemonic  for  broadcusting  information 
and  "K"  for  collecting  information. 
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the  V-nodes.  At  last  the  final  result  emerges  from  the  output  root  node. 

Sorting  can  be  easily  implemented  on  a  tree  machine  using  the  heap 
sort  algorithm  [Mead80].  To  perform  union,  intersection,  and  join  on  rela¬ 
tions  A  and  B ,  Song  employed  two  solutions  [Song80].  One  is  to  sort  the  two 
argument  relations  using  the  tree  machine  and  to  perform  further  process¬ 
ing  elsewhere.  The  other  is  to  load  one  relation  in  [j-nodes  and  then  broad¬ 
cast  the  other  relation  onto  the  [>nodes  to  perform  the  required  operation. 
Partial  results  produced  in  the  []-nodes  may  have  to  be  saved  before  they 
can  be  accepted  by  the  V-nodes  (e.g.  in  performing  join).  The  potential 
bottlenecks  were  resolved  by  a  request/acknowledge  communication  con¬ 
vention  [SongBO]. 

The  BK-tree  machine  is  very  efficient  in  pipelining  successive  searching 
operations  which  take  a  single  data  item  as  the  operand.  However  it  does 
not  perform  as  well  on  database  operations  which  take  whole  relations  as 
operands.  Again,  the  BK-tree  machine  is  fundamentally  a  linear  time 
bounded  structure.  The  performance  barrier  is  inherited  from  the  general 
restriction  of  tree  structures  that  only  one  data  value  at  a  time  can  flow  into 
and  out  of  the  tree  through  the  root  node.  Furthermore,  the  VLSI  layouts  of 
large  trees  are  susceptible  to  the  propagation  delay  problem  [PateBl]. 

The  Ultracomputcr 

Ultracomputers  [SchwBO]  are  those  with  powerful  and  physically  real¬ 
ized  interconnection  patterns,  They  are  composed  of  a  large  number  of  pro¬ 
cessing  elements  each  connected  with  a  fixed  number  of  others.  The  Ultra¬ 
computer  in  [Schw80]  is  based  on  the  perfect  shuffle  interconnection 


[Ston71],  Other  powerful  interconnections  like  the  Cube-Connected-Cycles 
(CCC)  [PrepBl]  are  also  in  this  category  which  we  refer  to  as  ultracomput¬ 
ers. 

On  the  Ultracomputer  with  the  perfect  shuffle  interconnection,  all  the 
permutations  of  data  among  processing  elements  can  be  realized  in  logn 
routing  steps.  Sorting,  union,  intersection,  and  difference  can  thus  be 
solved  in  Logarithmic  time.  No  results  about  duplicate-removal  and  join  are 
reported  in  [SchwBO].  While  the  ultracomputer  is  efficient  in  solving  certain 
compute-bound  operations,  it  is  expensive  to  implement.  Expandability  is 
poor  because  the  interconnection  complexity  grows  at  least  as  a  function 
nVlog^  of  the  number  of  processing  elements  n  [ThomBO].  Moreover  pro¬ 
pagation  delay  problems  and  synchronization  difficulties  can  become  more 
severe  when  n  is  large. 

The  CHiP  Computer 

A  Configurable,  Highly  Parallel  (CHiP)  [Snyd82]  processor  permits  the 
processor  interconnections  to  be  dynamically  programmed.  It  does  not 
limit  the  communication  to  one  fixed  structure  among  the  processing  ele¬ 
ments.  Nor  does  it  rely  on  a  single  interconnection  capable  of  simulating 
others  to  achieve  the  flexibility  of  communicating  processing  elements.  It 
provides  a  lattice  of  programmable  switching  elements  with  which  dynamic 
and  flexible  interconnections  can  be  specified. 

The  processing  elements  are  connected  to  the  switch  lattice  at  regular 
intervals.  The  interval  determines  an  important  parameter  w  of  the  switch 
lattice  which  is  called  the  corridor  width.  Two  more  parameters  of  the 


switch  lattice  which  are  important  to  this  research  work  are  the  degree  (or 
the  number  of  incident  data  paths)  d  and  the  cross-over  capability  c  of  the 
switches.  The  cross-over  capability  denotes  the  maximum  number  of 
independent  data  paths  that  can  pass  through  a  switching  element.  In  Fig¬ 
ure  2-4  we  show  two  structures  of  the  switch  lattice.  The  circles  represent 
switches  and  the  squares  represent  processing  elements. 
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Figure  2-4.  Two  structures  of  the  switch  lattice: 
(a)  w  -  1 ,  d  -  4;  (b)  w  =  2,  d  -  8. 


At  each  switching  element  there  is  some  local  memory  for  storing  a 
fixed  number  of  switch  settings.  The  controller  broadcasts  a  command  to 
the  switches  and  the  switches  then  make  connections  according  to  a  partic¬ 
ular  switch  setting  stored.  The  total  effect  of  making  connections  at  the 
switches  constitutes  the  designated  interconnection.  The  processing  ele¬ 
ments  then  communicate  with  each  other  assuming  that  the  right  intercon¬ 
nections  are  realized  by  the  switches,  (See  [Snyd82]  for  more  detailed 
description  of  the  CHiP  computer;  See  [SnydBl]  for  a  discussion  of  program¬ 
ming  processor  interconnections.) 
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The  CHiP  computer  can  be  easily  configured  to  be  a  mesh-connected 
computer.  With  the  mesh  interconnection,  sorting  can  be  done  in  0(Vn) 
time  using  adapted  algorithms  of  Batcher's  bitonic  sort  [Batc6B,  Kung77, 
Nass79],  In  Chapter  3  we  shall  present  a  primitive  operation  POP-SORT 
which  can  perform  the  five  database  operations  listed  in  Table  2-1.  POP-SORT 
does  not  require  the  special  architecture  of  the  CHiP  computer.  On  the  con¬ 
trary,  it  represents  a  methodology  of  applying  parallel  sorting  to  solve  other 
database  operations.  POP-SORT  can  be  implemented  on  the  tree  machine, 
the  Ultracomputer,  the  mesh-connected  computer,  and  the  CHiP  computer. 
If  sorting  can  be  implemented  with  systolic  arrays  then  the  systolic  arrays 
can  also  be  easily  modified  to  implement  POP-SORT. 


P 


CHAPTER  3 


AN  EFFICIENT  PRIMITIVE  OPERATION 


Highly  parallel  algorithms  for  database  operations  have  been  widely 
studied.  Several  algorithms  that  perform  sorting  in  sub-linear  time  exist 
[BatcSB,  Ston71,  Thom77,  Nass79;  Mull75,  Hirs78,  Prep78].  The  set  opera¬ 
tions  union,  intersection,  and  difference  are  best  solved  by  performing  sort¬ 
ing  first  [SchwBO].  For  duplicate-removal  and  join,  there  are  linear-time 
bounded  algorithms  [KungBO,  SongBl].  Mentioned  above  are  different  algo¬ 
rithms  and  different  machine  architectures  (see  Table  2-1). 

VLSI  implementation  of  specialized  devices  has  been  vigorously  pro¬ 
posed  [Kung?9,  FostBO,  KungBO,  Kung82].  However,  cost-effectiveness  of 
VLSI  implemented  systems  depends  fundamentally  on  regularity  and  unifor¬ 
mity.  The  initial  development  expenses  of  VLSI  systems  must  be  offset  by 
volume  production.  Thus,  for  VLSI  implementation  of  highly  parallel  ver¬ 
sions  of  database  operations,  it  is  important  to  identify  a  nucleus  of  process¬ 
ing  steps  common  to  the  many  database  operations. 

On  general-purpose,  highly  parallel  computers,  programmability  of 
algorithms  again  depends  on  regularity  and  uniformity.  It  is  extremely 
expensive  to  develop  software  for  highly  parallel  computers.  Therefore,  for 
performing  database  operations  on  highly  parallel  computers,  it  is  also 


Important  to  identify  an  efficient  primitive  operation. 


Much  work  on  highly  parallel  sorting  has  been  reported  and  has  demon¬ 
strated  some  efficient  solutions  [BatcBB,  Ston71.  Thom77,  Nass79;  Mull75, 
Hirs78,  Prep78].  To  identify  primitive  processes  for  database  operations,  we 
thus  apply  algorithmic  approach  to  reduce  many  database  operations  to  a 
sorting-based  primitive.  Whatever  sorting  algorithm  and  machine  architec¬ 
ture  are  chosen,  we  then  always  have  a  unified  treatment  of  those  opera¬ 
tions  by  implementing  them  with  the  primitive  operation. 

In  this  chapter  we  shall  present  POP-SORT  (Primitive  Operation  SORT) 
as  a  primitive  operation  for  sorting,  duplicate-removal,  union,  intersection, 
and  difference.  The  latter  three  operations  are  relaxed  to  have  multisets  as 
operands.  This  relaxation,  surely  based  on  the  versatility  of  POP-SORT  on 
the  one  hand,  has  much  practical  merit  in  the  context  of  query  processing 
on  the  other  hand.  For  natural  join  and  equi-join,  sub-linear  time  algorithms 
are  possible  if  relations  are  preconditioned  by  using  FOP-SORT. 

In  Section  3.1  we  present  a  special  family  of  POP-SORT  which  is  based 
on  merge-oriented  sorting  methods.  Employing  a  new  comparison  function, 
any  merge-oriented  sorting  method  becomes  POP-SORT.  An  efficient  imple¬ 
mentation  of  the  new  comparison  function  and  the  overall  performance  of 
the  POP-SORT  are  shown  in  Section  3.8.  In  Section  3.3  we  present  a  general 
adaptation  scheme  that  modifies  any  sorting  algorithm  to  become  POP- 
SORT.  We  then  show  the  application  of  POP-SORT  to  the  natural  join  and 
equi-join  operations  in  Section  3.4. 


3.1  POP-SORT,  a  Special  Example 

Among  the  fast  and  highly  parallel  sorting  algorithms,  we  are  most 
interested  in  constructive,  potentially  logarithmic  time,  and  non- 
probabilistic  algorithms.  There  are  two  categories  of  comparison-based 
sorting  algorithms  that  rely  fundamentally  on  pairwise  comparisons.  One 
category  can  be  modeled  as  sorting  networks  [Knut73,  p.220]  that  are  con¬ 
structed  from  comparator  modules  [Batc68,  StonTl,  Thom77,  Nass79].  The 
other  is  based  on  the  enumerating  comparison  method  that  each  item  is 
compared  with  each  of  the  others  [Mull75,  Prep78],  While  the  former  is  con¬ 
jectured  to  require  ^(log^n)  levels  of  network  depth,  the  latter  is  able  to 
reduce  the  time  complexity  to  0(log  n).  However  a  considerable  drawback 
with  the  enumeration  sort  is  the  requirement  of  0(nz)  computing  com¬ 
ponents  or  the  assumption  of  a  shared,  random  access  memory. 

Batcher’s  bitonic  merge  sort  [Batc68],  described  as  a  sorting  network 
in  Appendix  A-l,  is  one  of  the  most  famous.  There  are  many  adapted  versions 
of  the  bitonic  sort.  It  requires  0(Vn)  time  using  mesh  interconnection 
[Thom77,  Nass79]  or  0(logan)  steps  using  shuffle  interconnection  [Ston71]. 
The  number  of  computing  components  needed  for  these  adapted  algorithms 
may  be  as  small  as  0{n). 

In  this  section  we  shall  present  a  special  example  of  POP-SORT  called 
the  bitomc  POP-SORT.  This  instance  of  POP-SORT  uses  a  new  comparison 
function  in  Batcher's  bitonic  sorting  method.  The  scheme  that  adapts  the 
bitonic  sort  to  become  POP-SORT  relies  on  the  merge-oriented  nature  of  the 
bitomc  sort.  Therefore,  the  adaptation  scheme  is  immediately  extended  to 
all  the  merge-oriented  sorting  methods. 
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Bitonic  sort  is  based  on  a  simple  local  operation  together  with  a  regular 
and  efficient  way  of  pairwise  data  communication  (Figure  A-l).  The  local 
operation  is  a  simple  comparison  function  which  can  be  described  as: 


x 

V 


min 

max 


{x,y)  -*  (min(x , y ), max(z ,  y )); 
when  x  =  y ,  min  =  max. 


The  communication  scheme,  from  another  point  of  view,  is  actually  a 
sequence  of  perfect  shuffle  on  different  numbers  of  data  items.  Perfect 
shuffle  is  so  powerful  that  it  can  simulate  many  important  communication 
functions  in  time  proportional  to  the  logarithm  of  the  number  of  data  items 
[SchwBO].  It  should  be  able  to  solve  other  database  operations  if  the  simple 
comparison  is  replaced  by  more  sophisticated  ones. 

Definition  The  compare-and-mark  x  operation  performs  comparison  as  well 
as  marking  duplicates,  and  the  marking  process  is  idempotent: 

(1)  (x,y)  -*  (min(x,y),  max(x.y))  whenx  /  y\ 

(2)  (x,x),  (x-,x),  or  (x,x~)  -»  (x~,x)  and 
(x-,x")  -»  (x",x_). 


The  basic  operation  compare-and-mark  x  preserves  the  ordering  among 
distinct  elements  as  usual.  By  marking  a  duplicate  of  x  as  x~  the  basic 
operation  enforces  am  ordering  rule  such  that  x~  is  a  little  smaller  than  x 
but  never  smaller  than  any  y  for  y  <  x .  The  ordering  among  the  marked 
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duplicates  x~’s  is  arbitrary.  The  marking  capability  of  the  basic  operation  is 
to  magnify  the  computation  power  of  the  bitonic  sort  to  performing 
duplicate-removal.  Rather  than  proving  this  for  the  bitonic  sort  only,  we 
would  prove  a  more  general  application  of  compare -and-mark  j  to  all  the 
merge-oriented  sorting  methods  in  the  following  theorem. 

Theorem  3-1.  Using  the  compare  -and-mark  1  operation,  any  merge -oriented 
sorting  method  can  mark  off  all  the  duplicates, 

[Proof]  Consider  any  comparison-based  method  which  merges  two 
ordered  sub-lists.  Every  pair  of  neighboring  elements  in  the  result 
list  must  have  been  compared  directly,  unless  both  elements  are 
from  the  same  sub-list.  If  both  sub-lists  have  duplicates  marked  off 
before  merge  then  the  result  list  must  have  all  the  duplicates 
marked  by  using  the  compare-and-mark  1  operation.  For  any 
merge-oriented  sorting  method  that  starts  with  merging  sub-lists  of 
length  one,  it  guarantees  no  duplicates  at  all  in  the  very  beginning. 

By  induction,  all  the  duplicates  must  have  been  marked  off  in  the 
final  sorted  list.  ■ 

In  addition  to  performing  duplicate-removal,  any  merge-oriented  sort¬ 
ing  method  using  compare-and-mark  x  is  able  to  perform  union.  Performing 
union  is  the  same  as  performing  duplicate-removal  on  the  totality  of  the  two 
groups  of  data  items.  If  our  purpose  is  to  unify  sorting,  duplicate-removal, 
and  union  then  compare-and-mark  j  is  powerful  enough.  However  we  are  aim¬ 
ing  at  identifying  a  primitive  for  more  database  operations.  Intersection  and 
difference  take  two  sets  of  data  items  as  operands.  One  fundamental 


l  5L ■  jk.L 


requirement  is  that  we  must  be  able  to  distinguish  data  items  from  tb<  two 
groups  in  order  to  perform  these  two  operations.  We  therefore  extend  the 
compare -and-mark  j  operation  to  handle  two  groups  of  data  items. 

Definition  Let  A  and  B  be  multisets,  a  e  A  and  b  e  B.  The  compare-and- 
markz  operation,  in  addition  to  performing  the  simple  comparison,  enforces 
marking  duplicates  and  three  ordering  rules: 

(1)  Idempotent  marking-minus: 

(a, a),  (a-,  a),  or  (a, a")  -»  (a", a); 

(a“,  a")  ->  (a-,  a~). 

(2)  Idempotent  marking-plus: 

(I b,b )  ->  (b\b),  or  (b  ,b*)  -  (6,6+); 

(fc+.  b+)  -*  (6  +,  6  +). 

(3)  Quasi-stability: 

(a,  b )  or  (6,  a)  -*  (a,  6)  for  a  =  b  (marked  or  unmarked). 

Similar  to  that  shown  in  Theorem  3-1  the  marking  capability  of  the 
compare-and-markz  extends  the  computation  power  of  the  bitonic  sort  to 
performing  duplicate-removal  and  union.  Moreover,  two  separate  marking 
rules  allow  us  to  mark  duplicates  for  two  multisets  separately.  The  rule  of 
quasi-stability  insists  that  A  -elements  precede  5-elements  if  they  all  have 
the  same  value.  With  the  local  operation  having  two  separate  marking 
mechanisms  and  being  quasi-stable,  the  execution  of  the  bitonic  sort  will 
end  up  with  a  sorted  sequence  like  ...a~a~a~a  b  6+6  +  ...,  where  a  =  b.  Wc 
then  can  detect  and  manipulate  all  the  ..ab.,  pairs  in  constant  time.  The 
bitonic  communication  scheme  together  with  the  compare-and-mark2 


operation  therefore  can  also  implement  the  intersection  and  difference 
operations  too.  The  three  operations  union,  intersection,  and  difference  are 
even  relaxed  to  have  multisets  as  operands.  We  have  thus  proved  the  follow¬ 
ing  theorem. 

Theorem  3-2.  (POP-SORT)  With  the  compare-and-mark 2  operation,  any 
merge-oriented  sorting  method  can  be  used  for  duplicate-removal,  union, 
intersection,  and  difference. 

How  do  we  unify  the  operations  that  take  a  single  multiset  as  operand 
and  the  others  that  take  two  multisets?  It  requires  some  initial  processing 
on  input  operands.  In  algorithm  3-1,  execution  of  the  database  operations  is 
partitioned  into  three  phases:  initialization,  primitive,  and  completion.  The 
input  to  the  algorithm  may  be  one  or  two  multisets.  The  input  conflict  is 
resolved  in  the  initialization  phase.  Only  in  the  completion  phase  may  the 
database  operations  invoke  different  constant-time  post-sorting  processing. 
The  output  from  the  algorithm  is  that  all  the  undesired  data  are  marked  off, 
either  marked  as  z"  or  x  +  . 

Algorithm  3-1:  The  bitonic  POP-SORT. 

INTPUT:  Data  items  from  one  or  two  multisets  A  and  B . 

OUTPUT:  All  the  unmarked  data  items. 

A.  Initialization  phase 

1.  Data  items  are  arbitrarily  labeled  as  A-elements  for  sorting, 
duplicate-removal,  and  union. 

2.  For  intersection  and  difference,  A-elements  and  -elements 
are  labeled  differently  in  order  to  distinguish  them 
throughout  the  whole  processing. 


B.  Primitive  phase 


1.  Run  the  bitonic  sort  using  the  compare-and-markz  operation. 
C.  Completion  phase 

1.  Remove-duplicates,  sorting,  and  union  do  not  need  any 
further  processing. 

2.  For  intersection  and  difference,  the  constant-time  processing 
in  this  phase  is  shown  as  a  program  segment  in  the  following. 

(*  completion  phase  *) 
for  all  i  do  (*  xn  +  i  =  °°  is  a  dummy  *) 
compare  xt  with  xi+1 
if  both  unmarked  then 
case 

intersection:  mark  xt~; 

if  not  equal  then  markXi+i+; 
difference:  mark  xt  +  1+; 

if  equal  then  mark  x4_; 


The  relaxation  that  union,  intersection,  and  difference  take  multisets  as 
operands  of  course  relies  on  the  versatility  of  POP-SORT.  The  practical  con¬ 
sideration  is  that  multisets  are  artifacts  of  operations  such  as  projection 
and  concatenation.  Evidently  many  query  languages  (SEQUEL,  QUEL,  and 
QBE  [Ullm80])  provide  operators  for  working  with  multisets.  On  many  occa¬ 
sions  in  database  query  processing,  duplicate-removal  and  union  (intersec¬ 
tion,  or  difference)  are  executed  subsequently.  For  example,  projection  is 
first  requested  before  two  relations  are  to  be  joined,  n  (/?j)  (J  II  {R g).  where 
II  denotes  projection.  In  order  to  perform  the  set  operation  (J ,  duplicate 
tuples  produced  by  the  operation  projection  must  be  removed.  We  have  the 
following: 


union {A,B)  =  rmdup{A)  (j  rmdup(B), 
inter  {A,  B)  -  rmdup(A)  n  rmdup  ( B  ), 
differ{A,B)  =  rmdup  (A)  -  rmdup(B), 


With  the  relaxation  the  two  operations  are  combined  together  and  a  single 
run  of  sorting  is  enough.  However,  without  the  relaxation,  performing  the 
two  operations  sequentially  is  necessary.  The  sequential  execution  in  this 
case  may  imply  more  data  movement  and  programming  overhead. 

The  result  sequence  could  be  sparse  due  to  the  marked-off  duplicates. 
The  marked-off  duplicates  can  be  Altered  out  while  outputting  the  sequence. 
Alternatively,  in  some  applications  one  might  want  to  compress  the 
sequence  internally  so  that  the  marked  duplicates  are  squeezed  out. 
Schwartz  presented  an  ingenious  method  to  separate  and  pack  marked  data 
on  the  Ultracomputer  in  O(logn)  time  [SchwBO],  If  the  shuffle-exchange 
interconnection  is  available  the  compression  job  can  then  be  best  done  by 
Schwartz's  pack  algorithm.  A  desirable  solution  might  be  running  POP-SORT 
again  using  another  comparison  function  which  treats  the  marked  dupli¬ 
cates  as  +oo. 

3.2  Implementation  and  Performance 

Different  interconnection  patterns  among  processing  elements  for  the 
bitonic  sort  and  their  implementations  have  been  reported  in  the  literature 
[Batc68,  Ston71,  Thom77,  Nass79,  SchwBO,  PrepBl].  The  local  operation  at 
each  processing  element  is  the  crucial  part  that  may  extend  a  merge- 
oriented  sorting  algorithm  to  perform  other  database  operations.  We  shall 
consider  only  the  implementation  of  the  local  operation  in  this  section. 

An  efficient  implementation  of  the  compare-and-mark  j  operation  uses 
one  extra  bit  for  marking.  The  mark  bit,  initially  set  to  be  1,  is  appended  to 
each  data  item  as  the  least  signiAcant  bit.  The  operation  works  simply  to 
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clear  one  least  significant  bit  whenever  two  elements  are  found  to  be  equal. 

Similarly  the  com.pare-and-vna.rkz  can  be  implemented  using  two  mark 
bits,  one  for  distinguishing  ^-elements  from  i? -elements  and  the  other  for 
marking  duplicates.  The  mark  bits  are  tagged  to  each  data  item  as  the  two 
least  significant  bits.  Let  a  and  b  be  £  -bit  data  items  concatenated  with  the 
two  mark  bits,  asA  and  btB.  Their  binary  representations  are 
(a{_1>  a,_2  ....  at  a0i  a_j  a_2)  and  (faj-j.  b£_2i  ....  bXi  b0  b_x  b_2)  respectively. 
Initially,  we  have  the  mark  bits  set  in  such  a  manner  that  (a_j,  a_2)  =  (0.1) 
and  (6_i,  6_2)  =  (1,0).  The  compare-and-markg  function  can  be  described 
as: 


x 

y 


1 —  min 
—  max 


if  *  =  y  then  x_2  «-  x_j; 

min  «-  min(x.y).  max  «-  max(x,  y)\ 

The  compare-and-markz  function  can  be  interpreted  more  clearly  using 
a  state  diagram  as  shown  in  Figure  3-1.  Define  the  state  of  a  data  item  as  the 
value  of  its  two  mark  bits.  There  are  only  four  possible  states,  with  (0,1)  the 
initial  state  for  all  the  4-elements  and  (1,0)  for  ^-elements.  The  rule  of 
marking-mir"  changes  the  state  (0,1)  to  (0,0)  for  4-elements.  Since  the 
marking  is  idempotent,  once  a  data  item  reaches  the  (0,0)  state  it  remains 
in  that  state.  The  rule  of  idempotent  marking-plus  works  in  the  same  way  for 
^-elements. 
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Figure  3-1.  State  digram  for  two  idempotent 
marking  functions. 


After  the  completion  phase  of  POP-SORT,  all  the  desired  data  items  may 
be  in  the  state  either  (0,1)  or  (1.0).  Suppose  we  arbitrarily  choose  (0,1)  as 
the  final  state  of  all  the  desired  data  items.  To  separate  and  pack  all  the 
unmarked  data  using  POP-SORT  again,  we  need  some  more  bit  manipulation 
capability.^  First,  reset  the  states  (1,0)  and  (1,1)  to  (0,0).  We  then  may 
rotate  each  data  item  such  that  all  the  desired  data  has  the  most  significant 
bit  1.  Alternatively,  we  may  design  the  second  mark  bit  with  some  flexibility 
so  that  it  may  be  programmably  tagged  to  each  data  item  as  the  least  or 
most  significant  bit. 

Several  adapted  versions  of  the  bitonic  sort  show  that  more  data  rout¬ 
ing  time  is  required  than  comparison  time.  Suppose  that  a  merge-oriented 
sorting  algorithm  takes  7  j(n)#  tR  +  Tz(n)*  tc  time,  where  7’1(n)  is  the 
number  of  data  routing  steps  and  Tz(n)  the  number  of  comparison  steps. 
The  POP-SORT  based  on  this  sorting  algorithm  then  requires 
Tx{n)*  tR  +  Tz(n)*  t'c  time,  The  only  difference  is  the  step  size  t'c.  That  is, 

t  Unfortunately  the  bitonic  sort  is  not  stable.  Otherwise,  performing  sorting  on  the  two 
mark  bits  would  be  able  to  separate  marked  and  unmarked  data  items. 
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the  processing  time  for  one  local  operation  is  changed.  The  marking  func¬ 
tion,  on  one  or  two  bits,  of  the  local  operation  usually  takes  less  time  than 
the  comparison  function  (Z  bits).  The  ratio  of  t'c  to  tc  is  bounded  by  a  small 
constant,  actually  close  to  one. 


t'c 


where  p  -  .  for  bit  serial  design, 

(r 


or  p  =  — *-for  bit  parallel  design, 
log  £ 


In  summary,  the  bitonic  POP-SORT,  based  on  Batcher's  bitonic  merge 
sort,  performs  as  well  as  the  bitonic  sort.  It  compares  favorably  with  other 
algorithms  known  for  the  five  basic  database  operations  (Table  2-1).  The 
bitonic  POP-SORT  outperforms  Kung's  and  Song's  duplicate-removal  algo¬ 
rithms  dramatically.  For  the  other  operations,  we  do  not  sacrifice  any 
efficiency  by  using  it.  Since  the  bitonic  POP-SORT  serves  as  a  primitive  for 
many  operations,  the  overall  system  performance  may  improve  substan¬ 
tially  (e  g.  query  embedding  in  Chapter  6).  The  program  loading  is  no  longer 
necessary  for  every  single  operation.  Data  movement  can  be  reduced 
because  data  may  stay  longer  for  more  processing. 


3.3  POP-SORT,  in  General 

Batcher’s  bitonic  merge  sort  has  been  shown  easily  adaptable  to 
become  POP-SORT.  The  s2-way  merge  sort  performs  even  better  than 
bitonic  sort  on  a  mesh-connected  computer  when  the  number  of  data  items 


is  large  [Thom77j.  According  to  Theorem  3-2,  we  already  have  the  first 


order  generalization  that  any  merge-oriented  sorting  algorithm  can  employ 
the  compare-and-markz  operation  to  become  POP-SORT.  Of  course  s2-way 
merge  sort  can  be  another  base  sorting  algorithm  for  POP-SORT.  However, 
can  we  also  adapt  other  sorting  methods  to  become  POP-SORT? 

In  this  section,  we  shall  show  a  general  scheme  to  employ  any  sorting 
algorithm  as  a  POP-SORT.  The  general  scheme  again  involves  extending 
some  marking  capability  to  a  base  sorting  algorithm.  In  its  most  general 
sense,  POP-SORT  thus  presents  an  idea  to  adapt  any  sorting  algorithm  to 
become  an  ePicient  primitive  for  many  database  operations. 

The  computation  power  of  the  basic  operation  compare-and-markz,  in 
addition  to  the  simple  comparison  function,  comes  from  enforcing  the  ord¬ 
ering  rules  of  quasi-stability,  marking-minus,  and  marking-plus.  For  a  sort¬ 
ing  algorithm  that  is  not  merge-oriented,  it  might  not  be  able  to  incorporate 
all  the  ordering  rules  into  the  comparison  function.  Nevertheless,  given  a 
sorted  sequence  of  data  items,  a  "shift-copy  and  compare”  scheme,  shown  in 
Figure  3-2,  is  able  to  detect  and  mark  all  the  duplicates.  If  the  linear  inter¬ 
connection  is  available  then  the  marking  process  requires  only  0(1)  time. 


+  +  + 


marking-minus  marking-plus 

Figure  3-2.  A  "shift-copy  and  compare"  scheme  for 
detecting  duplicates  in  a  sorted  sequence. 
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Suppose  that  newSORT  is  a  new  and  faster-than-ever  parallel  sorting 
algorithm.  Whether  newSORT  is  merge-oriented  or  not,  it  can  be  adapted  to 
become  POP-SORT  according  to  the  general  scheme  described  in  Algorithm 
3-2.  The  general  scheme  is  composed  of  four  phases:  initialization,  sorting, 
marking,  and  completion.  A  general  POP-SORT  is  exactly  the  same  as  a 
merge-oriented  POP-SORT  in  the  initialization  and  completion  phases.  For  a 
merge-oriented  POP-SORT,  the  second  and  the  third  phases  of  a  general 
POP-SORT  is  combined  together  due  to  the  reinforced  computation  power  of 
c  ompare  -and-marfc  z . 

Algorithm  3-2:  A  general  POP-SORT. 

INTPUT:  Data  items  from  one  or  two  multisets  A  and  B . 

OUTPUT:  All  the  unmarked  data  items. 

A.  Initialization  phase 

1.  Data  items  are  arbitrarily  labeled  as  A -elements  for  sorting, 
duplicate-removal,  and  union. 

2.  For  intersection  and  difference,  A-elements  and  5-elements 
are  labeled  differently  in  order  to  distinguish  them 
throughout  the  whole  processing. 

B.  Sorting  phase 

1.  Sort  the  data  items,  labeled  as  A-elements  or  5-elements, 
according  to  the  quasi-stability  rule  using  newSORT. 

C.  Marking  phase 

1.  If  not  performing  sorting  then  continue. 

2.  Mark  duplicates  according  to  the  rules  of  marking-minus  and 
marking-plus  using  the  "shift-copy  and  compare"  scheme. 

D.  Completion  phase 

1.  If  not  performing  duplicate-removal  then  continue. 

2.  Intersection  and  difference  will  invoke  constant-time  but 
different  processing  as  in  Algorithm  3-1. 


♦V 


The  theoretical  lower  bound  of  the  time  complexity  of  newSORT  is 
0 (logn).  The  marking  phase  requires  only  O(l)  time  if  linear  interconnec¬ 
tion  is  provided.  The  first  and  the  last  phases  also  requires  only  constant 
processing  time.  Therefore  the  POP-SORT  using  newSORT  as  its  base  also 
shares  the  same  time  complexity  as  newSORT.  This  even  generalizes 
Theorem  3-2  —  Any  sorting  algorithm  can  be  adapted  to  a  four-phased  POP- 
SORT  without  introducing  any  significant  overhead. 

Similar  to  a  merge-oriented  POP-SORT,  an  efficient  implementation  for 
a  general  POP-SORT  needs  two  mark  bits.  One  of  the  mark  bits  is  used  for 
distinguishing  two  multisets,  and  the  other  is  for  marking  duplicates.  In  a 
general  POP-SORT  the  quasi-stability,  marking-minus,  and  marking-plus 
rules  are  still  enforced  using  the  two  mark  bits.  The  bit  manipulation  capa¬ 
bility  needed  in  a  general  POP-SORT  is  thus  no  less  than  that  in  a  merge- 
oriented  one. 

3.4  Application  to  Join  Operations 

The  number  of  result  tuples  after  joining  two  relations  A  and  B  denotes 
the  minimum  totality  of  computing  work  needed  for  join.  Assuming  each 
relation  of  size  n  for  simplicity,  the  figure  may  rarely  become  as  large  as 
0(n2).  Using  0(n)  processing  elements,  Kung's  [KungBO]  and  Song’s  [SongBl] 
linear  time  algorithms  are  optimal  in  the  sense  of  handling  the  worst  case. 
For  most  situations,  the  result  relation  has  many  fewer  tuples.  An  A-tuple 
may  have  to  join  with  only  some  5-tuples.  By  applying  POP-SORT  to  precon¬ 
dition  the  relations,  a  join  system  shown  in  this  section  can  perform  the 
natural  join  and  equi-join  operations  in  sublinear  time. 


Any  sorting  algorithm  can  bring  together  all  the  elements  of  the  same 
value.  The  groups  of  elements  of  the  same  values  are  called  aggregates .  We 
first  sort  the  relations  over  the  joining  attributes  using  POP-SORT.  The  prim¬ 
itive  operation  is  quasi-stable.  It  produces  aggregates  as  well  as  insists  that 
all  the  A-tuples  precede  5-tuples  in  each  aggregate.  We  then  can  perform 
natural  join  and  equi-join  simply  by  shifting  all  the  5 -tuples  in  one  direction 
to  join  with  A -tuples.  This  process  is  called  "easy-catch". 


control ler 


output  result  tuples 


Figure  3-3.  Logical  structure  of  the  easy-catch 
system  for  performing  join  operations. 


Define  d  as  the  longest  distance  that  a  5-tuple  needs  to  shift  in  order 
to  catch  all  the  joinable  /4-tuples.  For  easy-catch  d  is  the  largest  size  of  the 
aggregates.  To  reach  the  goal  of  having  sublinear  time  performance  the 
catching  process  is  better  terminated  after  d  shift  steps.  Unfortunately  d  is 
usually  not  known  beforehand.  In  Figure  3-3  we  show  a  solution  to  halting  the 
catching  process  by  superimposing  a  tree  interconnection  on  top  of  the  pro¬ 
cessing  elements.  A  halting  controller  located  at  the  root  of  the  tree  inter¬ 
connection  supervises  all  the  processing  elements,  The  tree  interconnection 
provides  the  communication  paths  between  the  controller  and  the  process¬ 
ing  elements.  Each  processing  clement  is  responsible  for  reporting  its 
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activity  by  sending  a  "busy"  or  "idle"  message  up  to  the  controller.  The  con¬ 
troller  will  broadcast  the  "halt"  message  when  it  decides  all  the  processing 
elements  are  idle.  If  a  halting  message  is  received,  the  processing  elements 


The  programming  of  the  join  system  is  extremely  simple.  All  the  pro¬ 
cessing  elements  execute  the  same  program  and  the  program  is  nothing  but 
a  looping  over  after  some  initialization.  Suppose  there  are  two  registers,  a 
and  fa,  capable  of  holding  A  and  B  tuples^  in  each  processing  element.  The 
processing  elements  execute  the  looping  program  as  follows: 


for  all  i  do 
(•  initialization  *) 
a*,  faj  «-  nil; 

if  A -tuple  then  load  a*  else  load  fat; 

(*  easy-catch:  shift  and  join  *) 
repeat  forever 
receive(msg); 
if  msg  =  "halt"  then  stop; 
shift;  (*  bt  «-  fa<+1  *) 

if  a*  match  6t  then  {perform  join;  send(”busy")} 
else  send("idle"); 


The  controller  detects  that  all  the  processing  elements  are  idle  after  a 
logn  time  delay.  Another  logn  time  delay  is  necessary  for  broadcasting  the 
"halt"  message  to  all  the  processing  elements.  The  time  for  performing  the 
natural  join  and  equi-join  is  thus  the  total  time  for  POP-SORT,  easy-catch, 
and  the  halting  delay. 


T  =  T(POP-SORT)  +  0(d)  +  O(log  n)  where  d  <  n. 


t  The  tuple  inay  only  consist  of  tuple-id  and  the  values  for  the  joining  attributes. 


v  v.-  v.  /.V.  v. 


\  \  \  \  \ 


Since  POP-SORT  needs  only  sublinear  time,  the  join  operations  can  be  done 
in  sublinear  time  as  long  as  d  is  less  than  0(n).  If  d  =  0(>/n )  then  the  join 
operations  can  be  done  in  0(Vn  )  time  using  the  bitonic  POP-SORT. 


The  CHiP  computers  are  good  candidates  for  implementing  the  join  sys¬ 
tem,  Suppose  that  data  items  from  both  A  and  B  are  sorted  by  POP-SORT  in 
a  quasi-stable  fashion  into  snake-like  row-major  order  (see  Chapter  5.)  Two 
co-existing  configurations  shown  in  Figure  3-4  are  feasible  if  there  is  a 
cross-over  capability  on  switches.  We  assume  that  fern-in  on  switches 
behaves  like  a  logic  "AND",  and  switches  also  have  fan-out  capability  to  per¬ 
form  broadcasting.  The  linear  and  tree  interconnections  for  the  join  system 
hence  are  provided  by  the  two  configurations. 


ooooooooo 


ooooooooo 


Figure  3-4.  Two  configurations  on  a  CHiP  computer 
for  implementing  the  easy-catch  system. 


For  the  purpose  of  area-economy,  the  join  system  is  implemented  as 
above  in  a  square  CHiP  region.  Unfortunately,  only  perimeter  processing  ele¬ 
ments  have  I/O  ports  to  the  peripheral  storage  devices.  There  would  be  a 
problem  of  non-uniform  distribution  of  result  tuples  since  they  would  accu¬ 
mulate  at  some  PEs.  We  call  this  the  hot  spots  problem. 


If  there  is  enough  memory  space  in  processing  elements,  the  hot  spots 
problem  does  not  do  any  harm  as  long  as  the  result  relation  is  to  be  dumped 
out  of  the  CHiP  processor.  For  some  cases,  the  result  relation  is  to  be  pro¬ 
cessed  further  (see  query  embedding  in  Chapter  6.)  Then  the  hot  spots  prob¬ 
lem  can  be  solved  by  the  Sprinkle  Algorithm  as  shown  in  Appendix  B.  The 
Sprinkle  Algorithm  employs  the  same  communication  scheme  as  a  single 
stage  of  the  bitonic  merge.  Let  k  be  the  maximum  number  of  result  tuples 

k 

at  hot  spots.  The  Sprinkle  Algorithm  requires  0{  — *  Vn)  time  using  mesh 

interconnection.  The  algorithm  works  especially  well  when  k  has  small 
values. 

This  join  system  can  perform  other  join  operations  too.  The  le-join  and 
ge-join  can  be  implemented  exactly  in  the  same  way  as  natural  join  and 
equi-join,  except  that  d  is  no  longer  the  largest  size  of  the  aggregates.  To 
perform  ne-join,  we  need  "two-way-catch”,  shifting  B  tuples  in  both  direc¬ 
tions  to  join  with  A  tuples.  The  join  system  is  especially  suitable  for  natural 
join  and  equi-join  because  the  value  of  d  is  more  likely  small  for  the  two 
types  of  join  operations. 

In  summary,  the  join  system  in  Figure  3-4  provides  adaptive  perfor¬ 
mance  for  join  operations.  "Easy"  joins  that  requires  5-tuples  join  with  only 
limited  numbers  of  A -tuples  are  suitable  for  easy-catch  implementation. 
They  can  be  done  with  much  better  performance  by  avoiding  executing 
them  as  "difficult"  joins. 


CHAPTER  4 


OPTIMALITY  OF  THE  PRIMITIVE  OPERATION 


The  order  of  data  items  often  has  a  profound  influence  on  the  speed  and 
simplicity  of  algorithms  which  manipulate  them  [Knut73].  As  a  conse¬ 
quence,  sorting  has  been  found  to  be  very  useful  as  a  pre-processing  step  for 
a  wide  variety  of  applications.  It  is  well  known  that  a  considerable  portion  of 
the  computer  running  time  was  and  still  is  spent  on  sorting. 

Although  sorting  is  useful,  in  some  cases  it  is  overused.  For  example, 
selection  of  the  median  of  n  data  items  requires  only  0(n)  comparisons, 
although  the  more  expensive  sorting  is  a  common  way  to  solve  it.  Moreover, 
sorting  is  completely  useless  in  some  other  cases.  Researchers  found  that 
the  benefit  of  data  ordering  yields  its  ground  to  the  computing  power  of 
parallel  hardware  on  the  searching  problems  (insertion,  deletion,  and 
update)  [Bent79].  Despite  these  observations,  the  usefulness  of  sorting 
might  be  underestimated  in  the  context  of  parallel  computation. 

While  the  usefulness  of  sorting  might  be  over-emphasized  in  the  sequen¬ 
tial  case,  the  feasibility  of  applying  sorting  in  the  parallel  case  needs  more 
careful  exploration.  POP-SORT  presents  a  mechanism  to  extend  sorting  to 
performing  many  other  database  operations.  A  methodology  for  applying 
parallel  sorting  to  the  solution  of  other  problems  is  thus  demonstrated.  In 


order  that  POP-SORT  be  an  optimal  primitive,  parallel  sorting  must  be  an 
optimal  way  to  implement  those  database  operations.  However,  is  parallel 
sorting  an  optimal  way  of  performing  those  database  operations? 

In  this  chapter  we  shall  investigate  the  optimality  of  the  primitive 
operation  POP-SORT.  We  show  how  the  reducibility  of  sorting  to  duplicate- 
removal  plays  a  crucial  role  in  determining  the  optimality.  We  then  concen¬ 
trate  on  studying  the  reducibility  of  sorting  to  duplicate-removal.  Two  com¬ 
parison  functions  are  considered:  the  strong  comparison  (<,=,>)  and  the 
weak  comparison  (  =  ,  *).  We  prove  the  reducibility  for  all  the  computations 
based  on  the  weak  comparison  function.  We  also  prove  the  reducibility  for  a 
subclass  of  computations  based  on  the  strong  comparison  function. 

Section  4.1  establishes  a  time-complexity  hierarchy  representing  the 
reducibility  relationships  among  POP-SORT  and  the  other  five  database 
operations.  These  relationships  show  that  the  hierarchy  would  collapse  if 
sorting  is  reducible  to  duplicate-removal.  A  collapsed  hierarchy  implies  the 
optimality  of  POP-SORT.  The  important  relationship  between  sorting  and 
duplicate-removal  is  then  studied.  A  special  model  of  parallel  computation 
suitable  for  our  study  and  two  types  of  comparison  functions  are  discussed 
in  Section  4.2.  In  Section  4.3  and  4.4,  we  investigate  the  reducibility  of  sort¬ 
ing  to  duplicate-removal  on  the  computation  model  with  the  two  comparison 
functions  respectively. 

4.1  Collapsing  the  Complexity  Hierarchy 

By  enforcing  some  extra  ordering  rules,  any  sorting  algorithm  can  be 
extended  to  become  POP-SORT  without  any  significant  overhead.  POP-SORT 


serves  as  a  primitive  operation  for  sorting,  duplicate-removal,  union,  inter¬ 
section,  and  difference.  The  bitonic  POP-SORT,  an  instance  of  the  primitive 
operation,  improves  the  upper  bound  for  duplicate-removal  over  the  algo¬ 
rithms  in  [Kung80]  and  [SongBl].  Also,  the  fastest  algorithms  known  for 
union,  intersection,  and  difference  apply  sorting  as  a  pre-processing  step 
[Schw80].  Therefore  POP-SORT  does  not  sacrifice  any  efficiency  for  unifying 
these  operations. 

However,  is  POP-SORT  an  optimal  primitive  for  performing  these  five 
database  operations?  To  evaluate  the  optimality  of  the  primitive  operation, 
we  investigate  the  complexity  relationships  between  it  and  the  five  opera¬ 
tions.  The  relationships  are  measured  in  terms  of  reducibility.  Let  Px  and 
Pz  be  two  problems,  and  i pt  be  any  algorithm  for  solving  the  problem  Pt. 
The  problem  P2  is  said  to  be  reducible  to  Px  iff  there  is  an  algorithm  ip2 
which  applies  V'i  to  solve  Pz.  We  are  most  interested  in  the  case  when  both 
algorithms  have  time  complexities  of  the  same  order,  i.e. 
0(Tty ,))  =  0(T{yz)). 

Some  important  reducibility  relationships  are  summarized  in  the  fol¬ 
lowing: 

•  All  the  five  operations  are  reducible  to  POP-SORT.  Chapter  3  presents 
POP-SORT  as  a  primitive  operation  which  can  perform  sorting, 
duplicate-removal,  union  intersection,  and  difference. 

•  POP-SORT  is  reducible  to  sort.  A  "shift-copy  and  compare"  scheme  is 
shown  in  Chapter  3  to  perform  the  marking-minus  and  marking-plus 
functions.  A  general  mechanism  based  on  the  scheme  is  presented  to 


adapt  any  sorting  algorithm  to  POP-SORT.  The  "shift-copy  and  com¬ 
pare"  scheme  takes  only  constant  time.  The  adaptation  overhead  is 
thus  negligible. 


•  Duplicate-removal  is  reducible  to  union,  intersection ,  and  difference . 
The  operations  union,  intersection,  and  difference  are  allowed  to  take 
multisets  as  operands,  Duplicate-removal  thus  can  be  implemented  as: 
rmdup  (A)  =  union(A.f)  =  inter  (A,A)  =  differ  (A, tp),  where  f  is  the 
empty  set. 


sorting  4 


remove- 
dupl  i cates: 


■>  POP-SORT 


un  i  on 

■>  intersection 
di f ference 


Figure  4-1.  Collapsing  the  time  complexity  hierarchy 
implying  the  optimality  of  POP-SORT. 


The  above  reducibility  relationships  are  also  depicted  as  a  time  com¬ 
plexity  hierarchy  in  Figure  4-1.  The  arrow  in  the  figure  denotes  the 
relationship  "is  reducible  to".  To  collapse  the  complexity  hierarchy  would 
imply  the  optimality  of  POP-SORT.  The  relationship  represented  by  the  dot¬ 
ted  arrow  " . >"  therefore  plays  an  important  role  in  collapsing  the  com¬ 

plexity  hierarchy.  For  POP-SOPT  to  be  an  optimal  primitive,  sorting  must  be 
an  optimal  way  to  perform  duplicate-removal.  The  key  to  unifying  the  five 
operations  by  POP-SORT  is  the  extension  of  sorting  to  mark  off  duplicate 
items.  Hence  there  is  no  surprise  that  the  optimality  of  POP-SORT  relies  on 
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the  optimality  of  sorting  to  perform  duplicate-removal. 

Muller  and  Preparata  [Mull75]  showed  a  constructive  switching  network 
of  O(logn)  depth  which  performs  sorting.  The  switching  network  is  an  imple¬ 
mentation  of  the  enumeration  comparison  method  in  which  each  data  item 
is  compared  with  any  other  one.  This  is  an  evidence  that  the  the  benefit  of 
parallel  hardware  supercedes  that  of  data  ordering.  The  switching  network 
can  be  used  to  implement  POP-SORT  achieving  the  theoretical  lower  time 
bound  Q(logn).  This  is  actually  an  immediate  proof  that  POP-SORT  based  on 
Muller  and  Preparata’ s  network  is  optimal.  It  is  also  a  proof  that  sorting  is 
reducible  to  duplicate-removal.  However  the  switching  network  requires 
0(nz)  comparators  and  switches.  In  the  following  sections  we  investigate 
further  the  reducibility  of  sorting  to  duplicate-removal  in  the  context  of 
fewer  processing  components. 

4.2  Comparison  Functions  and  Computation  Models 

This  section  discusses  comparison-based  computation  on  parallel 
machines.  We  point  out  that  there  are  two  types  of  comparison  functions 
that  must  be  considered,  We  also  present  a  universal  model  of  parallel 
machines  to  facilitate  our  study  on  the  reducibility  of  sorting  to  duplicate- 
removal. 

Comparison  between  two  elements  is  a  primitive  instruction  for  both 
sorting  and  duplicate-removal.  According  to  the  law  of  trichotomy,  exactly 
one  of  the  possibilities  x  <y ,  x=y,  x>y  is  true.  However  circuit  level 
implementations  of  the  pairwise  comparison  can  provide  this  information  in 
one  of  the  following  four  ways:  (l)  <,  =,  >;  (2)  >;  (3)  <,>:  and  (4)  =,  They 


all  involve  different  switching  logic  functions.  The  first  three  are  the  strong 
comparison  functions  which  can  be  shown  equivalently  powerful.  ^  The  last 
one,  called  the  weak  comparison  function,  is  not  adequate  for  sorting  though 
it  is  for  duplicate-removal. 

A  sorting  algorithm  should  use  one  of  the  strong  comparison  functions 
in  order  to  come  out  with  a  total  ordering.  For  a  duplicate-removal  algo¬ 
rithm,  it  is  not  necessary  to  assess  any  ordering  information.  It  may  use  the 
data  ordering  to  some  extent,  or  it  may  completely  ignore  the  data  order¬ 
ing.  That  is,  duplicate-removal  algorithms  may  use  the  weak  comparison 
alone,  the  strong  comparison  alone,  or  the  mixture  of  both  comparison  func¬ 
tions. 

A  variety  of  models  of  parallel  computation  ha  re  been  proposed.  They 
may  be  grouped  into  two  classes:  shared  memory  machines  and  fixed  con¬ 
nection  networks  [PrepBl,  BoroB2].  The  former  class  assumes  a  large  ran¬ 
dom  access  memory  shared  by  all  the  processing  elements  or  an  equivalent 
system  (see  examples  in  [FortTB,  Gold?8,  LevBl].)  The  latter  assumes  a  fixed 
interconnection  among  processing  elements,  or  between  processing  ele¬ 
ments  and  memory  modules  (see  examples  in  [Brow80,  SchwBO,  PrepBl].) 

In  terms  of  the  restrictions  on  accessing  memory  modules,  shared 
memory  machines  may  be  classified  into  three  categories:  concurrent  read 
or  write,  concurrent  read  but  exclusive  write,  exclusive  read  or  write.  Exe¬ 
cution  time  on  shared  memory  machines  is  usually  measured  as  the  number 
of  operation  steps  performed,  assuming  that  the  memory  access  time  is 

free.  This  type  of  computation  model  overlooks  technological  feasibility, 
t  Two  (S,  >)  or  (<,  S)  comparisons  are  equivalent  to  one  (<,  =,  >)  comparison. 


While  shared  memory  machines  are  suitable  for  deriving  lower  time  bounds, 
they  are  not  appropriate  for  studying  data  movement  realistically. 

For  current  hardware  technologies,  fixed  connection  networks  are  more 
reasonable.  However  a  single  interconnection  cannot  provide  optimal  hosts 
for  all  the  important  algorithms.  Furthermore,  many  problems  require  only 
infrequent  and  irregular  processor  communication.  Fixed  connection  net¬ 
works  are  too  restricted  to  study  the  reducibility  relationships  between 
sorting  and  duplicate-removal. 


Figure  4-2,  PIM  machine  as  a  model  of 
parallel  computation. 


In  order  to  study  sorting  and  duplicate-removal  on  a  general  base,  we 
need  a  universal  model  of  parallel  machines.  The  universal  model  must  be 
able  to  represent  each  specific  machine  model  and  is  suitable  for  studying 
data  ordering  and  data  movement.  For  these  purposes,  we  present  a  com¬ 
putation  model  called  the  PIM  machine  shown  in  Figure  4-2, 

The  PIM  machine  has  three  components:  a  group  of  processing  ele¬ 
ments,  an  interconnection  network,  and  a  collection  of  memory  modules 
(which  may  be  as  small  as  single  memory  words.)  Separate  memory  modules 
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enables  us  to  "observe"  data  Items  being  processed.  We  assume  that  the 
interconnection  network  has  all  the  flexibility  and  power  which  enables  the 
P1M  machine  to  emulate  any  parallel  machine. 

The  interconnection  network  provides  communication  paths  between 
the  processing  elements  and  the  memory  modules.  At  one  extreme,  we  may 
assume  that  the  interconnection  network  is  so  powerful  that  the  P1M 
machine  behaves  like  a  shared  memory  machine.  At  or  near  the  other 
extreme,  we  may  assume  that  the  interconnection  network  provides  fixed 
communication  paths  as  simple  as  those  for  the  linear  array  connection. 
For  emulating  reconflgurable  computers,  the  interconnection  network  has 
the  reconfigurability  to  provide  different  interconnection  patterns. 

Communication  overhead  is  important  on  parallel  machines,  especially 
when  the  interconnection  network  becomes  less  powerful.  On  the  PIM 
machine,  the  time  complexity  is  measured  by  taking  both  comparison  count 
and  data  movement  steps  into  account.  Data  communication  time  may  be 
absorbed  by  providing  feasible  interconnections  between  processing  ele¬ 
ments  and  memory  modules.  Transmission  time  is  assumed  independent  of 
the  lengths  of  communication  paths;  the  propagation  delay  problem  is  not 
an  issue  here.  For  example,  sorting  needs  0(-v/n)  data  routing  steps  and 
O(logzn)  comparison  steps  using  the  mesh  interconnection  [Thom77, 
Nass79],  or  O(logzn)  routing  and  comparison  steps  using  the  shuffle- 
exchange  interconnection  [Ston71]. 


4.3  On  Enumeration  Comparison 

Based  on  the  weak  comparison  function,  duplicate-removal  requires 
)£n(n- 1)  comparisons  since  every  pair  of  data  items  must  be  compared 
directly.  Taking  advantage  of  data  ordering,  or  using  the  strong  comparison 
functions,  the  total  comparison  count  may  be  reduced.  However,  the  total 
processing  time  is  not  necessarily  decreased  because  the  time  complexity  is 
measured  as  the  sum  of  parallel  comparison  steps  and  parallel  data  move¬ 
ment  steps.  The  absolute  requirement  of  the  J4n(n-1)  weak  comparisons 
therefore  does  not  exclude  the  possibility  of  a  fast  parallel  algorithm  for 
duplicate-removed. 

In  this  section,  we  shall  prove  that  sorting  is  reducible  to  any 
duplicate-removal  algorithm  that  is  based  on  the  weak  comparison  function. 
This  is  not  unreasonable  because  enumeration  comparison  methods  have 
been  proposed  for  sorting  [Knu73,  Mull75,  Prep7B]  in  which  each  data  item  is 
compared  with  every  one  of  the  others.  Naturally,  sorting  requires  the 
application  of  one  of  the  strong  comparison  functions. 

Let  ipx  be  a  duplicate-removal  algorithm  using  the  weak  comparison 
function.  The  execution  of  the  algorithm  may  be  functionally  partitioned 
into  two  stages:  (1)  performing  enumeration  comparisons,  and  (2)  determin¬ 
ing  mark  bits  (assuming  there  is  a  mark  bit  corresponding  to  each  data 
item.)  The  algorithm  thus  may  be  visualized  as  making  the  weak  comparis¬ 
ons  to  fill  up  a  triangular  table  (upper  triangular  bit  matrix)  and  figure  out 
the  mark  bits  as  shown  in  Figure  4-3, 


Notice  that  the  mark  bits  are  obtained  by  "ORing"  all  the  bits  on  each 
row.  The  following  program  segment  describes  the  abstract  function  of  the 
algorithm  i It  is  not  required  that  i/q  be  actually  executed  this  way. 


(*  i,  j  :  indices;  M  ;  matrix  *) 

(*  perform  enumeration  comparisons  *) 
for  all  i  <  j  do 

if  Xi  =  Xj  then  :=  1  else  M[i  j]  :=  0; 

(*  determine  mark  bits  *) 
for  all  i 

m i  :=  ORaiijx ]); 


Xi  Xg  Xg  X4  X5  Xg  X7 

x°  DDDDDD □ — >□  m° 

*t  □□□□□□ — >□ 

*2  □  □  □  □  □ . ->□ 

Xs  □  □  □  □ . ->□  ma 

□  □ . ->□ 

*5  □  □ . ->□  ^5 

® e  □ . »□  me 

*7 . >0  ™-y 

Figure  4-3.  The  function  of  enumeration  comparison 
methods:  table  filling  and  row  computation. 


Now.  perform  the  following  procedure  to  modify  the  algorithm  ^1: 


1.  Substitute  the  weak  comparison  function  with  the  strong  com¬ 
parison  function  (£,  >), 

2.  Fill  up  the  whole  matrix  rather  than  just  the  upper  triangular 
half  by  entering  two  entries  to  the  matrix  for  each  comparison 
performed. 

3.  Substitute  the  "OR"  operation  with  a  "SUM"  operation. 


The  abstract  function  of  the  new  algorithm,  say  ^2,  may  be  described  as  the 
following  program  segment: 

(*  perform  enumeration  comparisons  *) 
for  all  i  <  j  do 

if  X*  >  Xj  then  {M[i,ji]  :=  1;  Mjj.i]  :=  Oj 
else  :=  0;  Mfj.i]  :=  lj; 

(*  determine  unique  ranks  *) 
for  all  i 

n  :=  S 

j=i 

The  two  algorithms,  and  i^2,  are  not  necessarily  implemented  in  two 
clearly  separated  stages  as  described  in  the  program  segments.  The  pro¬ 
gram  segments,  however,  manifest  the  required  computation  that  must  be 
done  by  the  algorithms.  No  matter  how  the  two  functional  stages  of  ip1  are 
actually  executed  on  PIM  machines,  is  executed  in  the  same  way.  The 
algorithm  would  compute  unique  ranks  for  all  the  data  items  in  spite  of 
duplicates.  Both  algorithms  share  exactly  the  same  time  complexity, 
assuming  all  the  operations  take  unit  time.  We  have  thus  proved  the  follow¬ 
ing  Lemma. 

Lemma  4-1.  From  any  duplicate-removal  algorithm  based  on  the  weak  com¬ 
parison  function,  we  can  find  an  algorithm  to  compute  unique  ranks  for  all 
the  data  items  in  the  same  time. 

Let  xq.Xj . xn_,  be  a  sequence  of  data  items,  x(  e  [0,  m.-l]  and 

m  »  n.  The  sequence  can  be  transformed  into  a  sequence  of  unique  ranks 
x0, Xi . xn_i  (where  xt  is  the  unique  rank  of  x*)  by  the  algorithm  i^2.  Sort¬ 

ing  the  sequence  of  unique  ranks  is  much  easier  than  sorting  the  sequence 


of  original  data  items,  Thus,  a  two-phased  sorting  scheme  is  indicated.  It 
first  determines  unique  ranks  and  then  redistributes  data  items  according 
to  their  unique  ranks. 


Data  redistribution  given  the  unique  ranks  can  be  done  in  O(logn)  time 
with  the  switching  network  in  [Mull75],  With  the  assumption  of  a  shared 
memory,  it  takes  only  constant  time  [Prep78],  For  a  problem  that  has  one 
of  its  outputs  determined  by  all  the  n  inputs,  the  theoretical  lower  time 
bound  is  O(logn).  Remove-duplicates  or  determining  unique  ranks  there¬ 
fore  requires  no  less  time  than  redistributing  data  items.  Hence  we  have 
proved  the  following  theorem. 

Theorem  4-1.  Sorting  is  reducible  to  any  duplicate-removal  algorithm  that 
is  based  on  the  weak  comparison  function. 

4.4  On  Establishing  Total  Orderings 

In  this  section  we  investigate  if  sorting  is  reducible  to  duplicate- 
removal  based  on  the  strong  comparison  function  (<,=,>).  Although  the 
weak  comparison  function  is  adequate,  duplicate-removal  algorithms  using 
the  strong  comparison  function  take  advantage  of  data  ordering.  To  show 
the  reducibility,  we  need  to  prove  that  the  information  of  data  ordering  col¬ 
lected  by  duplicate-removal  algorithms  can  be  easily  transformed  to  an 
explicit  total  ordering  as  produced  by  sorting. 


We  first  define  semi -digraph  to  represent  the  minimum  set  of  com¬ 
parisons  required  for  duplicate-removal.  By  showing  that  the  semi-digraph 
must  contain  a  total  ordering,  we  prove  that  the  comparisons  required  for 
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duplicate-removal  are  also  adequate  for  sorting.  While  this  is  enough  to 
show  the  sequential  reducibility  of  sorting  to  duplicate-removal,  it  is  not 
sufficient  for  the  parallel  reducibility.  We  therefore  pursue  the  matter 
further  and  show  that  the  parallel  reducibility  is  true  at  least  for  a  useful 
type  of  homogeneous  computation. 

Let  X  =  {  Xq  x i  be  a  multiset  consisting  of  n  elements  from  a 

totally  ordered  set.  Define  Cm  as  the  minimum  set  of  comparisons  required 
for  the  elimination  of  duplicates.  A  semi-digraph  which  contains  both 
directed  and  undirected  edges  can  represent  the  set  Cm: 

Xi  • — »■  x}  if  Xi  >  Xj ,  or 
xt  • - •  Xj  if  Xi  =  Xj . 

The  semi-digraph  is  composed  of  n  vertices  and  no  more  than  J£n(n-1) 
edges.  For  a  path  between  a  pair  of  vertices  xt  and  xy,  the  path  is  undirected 
if  it  contains  only  undirected  edges,  or  the  path  is  directed  if  it  contains  at 
least  one  directed  edge.  In  the  semi-digraph,  directed  paths  denote  the  ord¬ 
ering  relationship  "is  greater  than"  or  "is  less  than",  and  undirected  paths 
denote  the  relationship  "is  equal  to”. 

To  guarantee  that  all  the  duplicates  are  found  there  must  exist  a  path 
between  any  pair  of  vertices  x,  and  xy,  i  ^  j .  Otherwise  the  ordering  rela¬ 
tionship  between  them  is  not  known.  The  path  is  either  undirected  or  one¬ 
way  directed.  The  graph  is  conflict  free  because  of  the  uniqueness  of  the 
ordering  relationship  between  any  vertex  pair.  In  other  words,  exactly  one 
of  the  possibilities  xt  <xy,  xt  =x;-,  X*  >x;-  is  represented  in  the  graph  for  each 
pair  of  vertices.  The  semi-digraph  should  contain  a  subgraph  equivalent  to 
that  shown  in  Figure  4-4.  Lemma  4-2  is  thus  proved. 


•<- 


<- 


-I ... 


Figure  4-4.  The  total  ordering  contained 
in  the  semi-digraph. 


By  Lemma  4-2  elimination  of  duplicates  always  needs  those  comparis¬ 
ons  which  are  sufficient  to  come  out  with  a  total  ordering.  Elimination  of 
duplicates  must  then  have  done  the  comparisons  required  for  sorting.  This 
is  enough  to  show  the  sequential  reducibility  of  sorting  to  duplicate-removal, 
since  the  sequential  time  complexity  can  be  reflected  by  the  comparison 
count  alone. 

On  PIM  machines,  data  communication  time  is  important.  Although 
duplicate-removal  requires  at  least  the  same  comparison  work  as  that  for 
sorting,  it  does  not  require  that  data  items  be  arranged  in  any  particular 
order.  To  arrange  data  items  in  order  may  entail  more  data  movement  time 
than  the  total  processing  time  for  duplicate-removal.  Without  further  study, 
it  is  not  possible  to  say  that  sorting  is  reducible  to  duplicate-removal. 
Nevertneless  the  existence  of  the  total  ordering  is  guaranteed  after  per¬ 
forming  any  duplicate-removal  algorithm. 


We  shall  prove  the  parallel  reducibility  for  a  special  type  of  parallel 
computation  that  insists  on  a  "homogeneous  sequence  of  execution" 
[Knu73,  p.220j.  Whenever  we  compare  x*  with  x;-  the  subsequent  execution 


for  the  case  xi<xj  is  exactly  the  same  as  for  the  case  a:(>x}-,  except  with 
the  data  values  interchanged,  This  type  of  computation  is  widely  applied  in 
practical  parallel  computation  since  the  complexity  of  the  decision  struc¬ 
ture  is  extremely  simple.  In  the  following,  we  first  define  general  com¬ 
parison  networks  to  simulate  the  execution  of  duplicate-removal  algorithms 
on  PIM  machines.  We  then  derive  versatile  comparison  networks  with  fixed 
connections  which  are  able  to  sort  and  identify  duplicates. 

Comparison  Network  Model 

Execution  of  algorithms  on  PIM  machines  can  be  traced  by  recording 
activities  at  processing  elements  and  value  changes  at  memory  locations. 
For  comparison  based  computation,  processing  elements  primarily  perform 
comparison  and  data  movement.  To  study  data  ordering,  we  focus  on 
observing  the  memory  part  and  further  impose  time-variant  ordering  rela¬ 
tionships  (<,  =,  >)  among  different  memory  locations.  Concurrent  writes  to  a 
memory  location  are  prohibited  lest  the  ordering  information  should  be  dis¬ 
rupted.  A  comparison  network  is  thus  presented  to  simulate  one  execution 
of  a  comparison-based  algorithm  on  PIM  machines. 

The  execution  of  a  comparison-based  algorithm  on  an  input  permuta¬ 
tion  can  be  recorded  as  a  sequence  of  comparison  steps  and  data  movement 
steps.  One  can  visualize  the  execution  as  applying  processing  elements  to 
memory  locations  as  many  times  as  the  number  of  operation  steps.  A  com¬ 
parison  network  has  four  important  parameters: 

n  -  problem  size  or  the  number  of  data  items, 
m  -  storage  capacity  or  the  maximum  number  of 


data  copies  at  any  instant, 
t  -  depth  of  network  or  the  number  of  parallel 
comparison/routing  steps, 
k  -  degree  of  parallelism  or  the  largest  number 
of  comparisons  that  can  be  performed  at  a 
parallel  comparison  step. 


Figure  4-5.  A  general  comparison  network. 

As  shown  in  Figure  4-5,  a  comparison  network  consists  of  two  types  of 
components:  loci  (in  circles)  and  comparators  (in  squares).  A  "locus”  is  a 
memory  location  capable  of  holding  one  data  copy.  There  are  totally  t 
instances  of  the  m  loci  in  the  network.  The  data  value  at  a  locus  may  (l) 
retain  its  previous  value,  (2)  copy  from  another  locus,  or  (3)  receive  a  value 
from  a  comparator.  Thus,  a  data  value  may  fan  out  to  have  multiple  copies 
(concurrent  reads'),  but  fan-in  of  many  data  values  is  undefined  (exclusive 
write).  A  comparator  reads  two  data  values  from  its  source  loci,  compares 
them,  and  v.-ptos  th-'m  out.  to  it"  object  loci.  Wo  assume,  for  generality,  the 
order  of  the  two  outputs  (inputs'*  of  a  comparator  is  not  important.  A 
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comparator  may  route  the  two  data  in  arbitrary  order  to  its  object  loci. 

The  I/O  of  the  comparison  network  is  where-oblivious  [LiptBl ] .  The  first 
n  Loci  are  arbitrarily  defined  as  output  loci.  In  the  very  beginning,  n  (input 
loci)  out  of  the  m  loci  have  the  n  data  items.  After  t  comparison  and  data 
routing  steps,  the  n  data  items  are  in  the  output  loci  with  all  the  duplicates 
marked  of!. 

General  comparison  networks  allow  broadcasting  and  multiple  copies  of 
data  items.  \  special  example  of  the  comparison  networks  is  called  rw- 
conflict-free  net-work.  Rw-conflict-free  networks  do  not  allow  fan  out  of 
data  values,  therefore  do  not  have  any  memory  conflicts,  neither  read 
conflicts  nor  write  conflicts.  The  sorting  network  in  [Knu73,  pp.  220],  or 
network  of  comparator  modules,  is  a  restricted  form  of  the  rw-conflict- 
free  network.  Sorting  networks  have  exactly  n  data  copies  (i.e.  n  =  m)  and 
strictly  route  data  items  in  a  pre-deflned  way.  To  each  comparator  the 
source  loci  and  object  loci  are  the  same  in  sorting  networks. 

Versatile  Network  Ns 

A  duplicate-removal  algorithm  does  not  have  to  arrange  data  items  in 
any  particular  order.  However,  our  approach  is,  for  any  duplicate-removal 
algorithm  ty,  to  derive  a  "compatible"  algorithm  ips  that  is  able  to  sort  as 
well  as  to  remove  duplicates.  The  derived  algorithm  does  not  increase  the 
time  complexity,  whereas  it  would  move  data  items  in  such  a  manner  as  to 
come  out  with  an  explicit  total  ordering.  This  approach  originates  from  the 
the  potential  exi  gence  of  a  total  ordering  described  in  Lemma  4-2. 


Suppose  that  the  execution  of  ip  of  a  given  input  permutation  is 
recorded  as  a  comparison  network  N .  A  restricted  form  of  the  comparison 
network  Nr  can  be  derived  from  N  such  that  Nr  emulates  N  in  the  same 
comparison  and  data  routing  steps.  In  the  network  Nr  the  comparators  are 
restricted  in  the  sense  that  the  larger  inputs  are  always  routed  to  the  upper 
object  loci.  (Notice  that  Nr  shares  the  four  parameters  with  N ,  but  enforces 
data  routing  differently.) 

Define  Ot  as  the  ordering  relationship  among  the  m  loci  at  the  i-th 
stage  in  the  network  Nr.  The  initial  relationship  O0  contains  no  information. 
The  restricted  network  Nr  has  rigid  interconnection  and  rigid  data  routing. 
By  induction,  the  ordering  relationships  O0,Ol,...,Ot  are  all  known.  The  ord¬ 
ering  relationship  Ot  must  contain  a  total  ordering  according  to  Lemma  4-2. 
We  can  then  rearrange  the  first  n  loci  in  Nr  according  to  the  ranks  deter¬ 
mined  by  the  total  ordering  Ot  and  come  out  with  a  new  network  Na .  Hence 
data  items  are  sorted  in  descending  order  in  Ns.  Based  on  this  observation, 
we  prove  the  following  theorem. 

Theorem  4-2.  Given  any  algorithm  ip  for  duplicate-removal  which  performs  a 
homogeneous  sequence  of  execution  on  exclusive-write  PIM  machines,  there 
exists  a  compatible  algorithm  \ps  which  runs  as  fast  and  is  able  to  sort. 

[Proof]  Input  permutations  are  not  relevant  to  the  execution 
sequence  because  of  the  homogeneity  of  execution.  Therefore  there 
is  a  single  comparison  network  N  which  simulates  the  execution  of  ip 
on  any  input  permutation.  Following  the  procedures  mentioned 
before,  the  network  N  can  be  transformed  into  Nr  then  into  Ns 
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without  changing  the  network  depths.  The  algorithm  %  correspond¬ 
ing  to  the  network  N ,  can  then  perform  duplicate-removal  and  sort¬ 
ing  in  the  same  time  as  i //  can  perform  duplicate-removal.  ■ 

Without  assuming  the  homogeneous  sequence  of  computation  we  may 
have  different  comparison  networks  with  different  depths  for  the  input  per¬ 
mutations.  Although  they  all  complete  the  comparison  work  represented  in 
the  semi-digraph,  the  final  relationship  Ot  may  not  be  fixed.  Whether 
Theorem  4-2  is  true  for  non-homogeneous  execution  needs  further  investiga¬ 
tion.  As  a  conclusion,  if  there  exists  a  single  comparison  network  to  model 
the  execution  of  a  duplicate-removal  algorithm  for  all  the  input  permuta¬ 
tions  then  we  can  derive  a  compatible  network  to  perform  sorting.  However, 
the  reducibility  of  sorting  to  duplicate-removal  does  not  imply  the  existence 
of  the  single  comparison  network. 

Applying  the  proof  technique  for  Theorem  4-2  to  an  exclusive-write, 
exclusive-read  network,  we  have  Corollary  4-1. 

Corollary  4-1.  Given  any  rw-conflict-free  network  for  duplicate-removal, 
there  exists  a  rw-conflict-free  network  which  has  the  same  depth  and  is  able 
to  sort. 

Some  well  known  parallel  machines  can  be  modeled  by  P1M  machines 
with  fixed  interconnection  patterns  between  n  processing  elements  and  n 
memory  modules.  Examples  are  the  1LLIAC  IV  computer  [Barn68],  tree 
machines,  and  the  ultracomputer.  The  execution  of  duplicate-removal  algo¬ 
rithms  on  a  particular  machine  can  then  be  modeled  by  a  especially  regular 
sorting  network.  If  we  rearrange  the  horizontal  data  lines  then  the  new 


sorting  network  may  not  preserve  the  original  pattern  of  connections.  That 
is,  the  interconnection  pattern  is  changed. 

Corollary  4-2.  Given  any  algorithm  for  duplicate-removal  on  a  exclusive- 
write  PIM  machine  with  interconnection  pattern  /,  there  exists  an  intercon¬ 
nection  pattern  Is  which  enables  the  algorithm  to  perform  sorting. 

It  is  not  necessary  that  the  interconnection  patterns  /  and  /,  in  Corol¬ 
lary  4-2  be  different.  On  the  machine  with  a  tree  interconnection  or  a  d- 
dimensional  mesh  interconnection,  we  can  prove  that  both  sorting  and 
duplicate-removal  can  be  solved  using  basically  the  same  algorithms.  On 
the  tree  machine,  implementation  of  the  heap  sort  requires  0{n)  time 
[Mea8l].  When  the  ordered  sequence  is  removed  from  the  root  node,  all  the 
duplicates  can  be  found.  On  the  mesh-connected  machine.  Batcher’s  sorting 
scheme  can  be  implemented  in  0{nl/i)  time,  where  d  is  the  degree  of  the 
dimension  [Thom77].  The  bitonic  POP-SORT  which  is  based  on  Batcher’s 
sorting  scheme  and  a  new  comparison  function  performs  duplicate-removal 
in  the  same  time. 

Due  to  the  I/O  bottleneck  at  the  root  node,  linear  time  is  the  best  per¬ 
formance  obtainable  from  the  tree  machine.  Based  on  the  argument  on  the 
longest  distance  that  data  may  need  to  move  on  the  mesh-connected  com¬ 
puter,  0(nl/d )  is  the  optimal  time  [Thom77].  These  two  examples  show  that 
duplicate-removal  can  be  best  solved  by  optimal  sorting  algorithms  on  the 


CHAPTER  5 


BITONIC  SORT  ON  THE  CHiP  COMPUTERS 


Batcher's  bitonic  sort  has  been  conjectured  to  be  the  best  network 
sorting  method  ^  [Prep78].  It  has  been  intensively  studied  and  several 
adapted  algorithms  for  machines  of  different  processor  interconnections  are 
available  [Ston71,  Thom77,  Nass79,  PrepBl],  The  bitonic  sort  can  be  done  in 
0(Vn  )  time  on  mesh-connected  computers  [Thom77,  Nass79].  It  requires 
only  0(logan)  time  with  the  shuffle-exchange  interconnection  [Ston71]  or 
the  cube-connected-cycles  (CCC)  [PrepBl], 

The  bitonic  POP-SORT,  based  on  Batcher’s  bitonic  merge  sort,  is  an 
important  example  of  POP-SORT.  It  can  simplify  the  processing  of  whole 
queries  on  the  CHiP  processors  significantly  (see  Chapter  6).  Implementa¬ 
tion  of  the  bitonic  POP-SORT  directly  refers  to  the  implementation  of  the 
bitonic  sort.  On  a  CHiP  computer,  it  is  feasible  to  embed  all  those  intercon¬ 
nections  on  the  switch  lattice  and  perform  the  adapted  algorithms. 

C.  D.  Thompson  proved  that  any  layout  of  the  shuffle-exchange  graph 
requires  at  least  0(na/logzn)  area  [Thom80].  There  are  layout  algorithms 
which  achieve  0(n2/log"n)  area,  a  =  1/2,  1,  3/2,  or  2  [ThomBO,  HoeyBO, 

KleiB  1  ].  However,  the  layouts  are  complicated  and  unavoidably  require 
t  0(logZn)  depth  and  0(nlog2n)  processing  components. 
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large  areas.  The  CCC  improves  the  shuffle-exchange  layouts  on  regularity, 
but  still  requires  large  embedding  areas  [PrepBl],  On  the  CHiP  computers, 
embedding  the  shuffle-exchange  layouts  tends  to  require  even  larger  areas, 
because  the  CHiP  processors  are  not  simple  grids. 

A  lacing  technique  can  be  used  to  exploit  the  cross-over  capability  of 
switches  for  embedding  layouts  on  the  CHiP  processors  (see  Appendix  B). 
The  lacing  technique  is  very  useful  in  embedding  complicated  interconnec¬ 
tions.  Nevertheless,  embedding  powerful  interconnections  like  shuffle- 
exchange  and  CCC  would  leave  a  large  portion  of  processing  elements 
unused.  Furthermore,  there  are  long  connection  paths  in  any  layout  of  the 
shuffle-exchange  graph  or  the  CCC.  Long  data  paths  are  vulnerable  to  the 
propagation  delay  problem  [BilaBl].  In  this  chapter  we  therefore  emphasize 
the  bitonic  sort  with  the  mesh  and  mesh-like  interconnections. 

We  shall  report  some  interesting  aspects  about  performing  the  bitonic 
sort  on  the  CHiP  computers  with  the  mesh  interconnection  or  its  variations. 
Embedding  the  mesh  interconnection  is  straightforward.  The  performance 
of  0(Vn )  matches  the  I/O  time  required  for  a  square  CHiP  region  anyway. 
Taking  advantage  of  the  switch  corridors  and  the  cross-over  capability  of 
switches,  one  can  further  improve  the  communication  power  over  the  simple 
mesh  interconnection. 

In  Section  5.1,  an  efficient  Rearrangement  Algorithm  is  presented  to 
reorder  sorted  data  items  from  one  indexing  scheme  to  another.  A  tech¬ 
nique,  called  sorting  with  shadow  regions,  is  shown  in  Section  5.2.  With  this 
technique,  allocation  of  exactly  n  processing  elements  is  sufficient  for  sort¬ 
ing  n  data  items  with  the  bitonic  sort  (n  is  any  integer,  not  necessarily  a 


power  of  2).  In  Section  5.3,  we  discuss  methods  to  improve  the  data  routing 
over  the  mesh  interconnection  with  the  switch  lattices.  We  then  address  the 
problem  of  sorting  more  data  items  than  the  number  of  processing  elements 
used  in  Section  5.4. 


5.1  Reordering  Between  Indexing  Schemes 

For  sorting  with  the  mesh  interconnection,  there  are  three  important 
schemes  of  indexing  the  processing  elements: 

(1)  Shuffled  row-major  indexing,  shown  in  Figure  5-l(a). 

(2)  Row-major  indexing,  shown  in  Figure  5-l(b). 

(3)  Snake-like  row-major  indexing,  shown  in  Figure  5-l(c). 

Data  items  are  sorted  into  particular  orders  defined  by  the  indexing 
schemes.  The  choice  of  a  particular  indexing  scheme  depends  on  how  the 
sorted  items  are  to  be  used. 
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Figure  5-1.  (a)  Shuffled  row-major  indexing,  (b)  Row-major 
indexing,  (c)  Snake-like  row-major  indexing. 


The  shuffled  row-major  indexing  comes  from  an  optimal  adaptation  of 
the  bitonic  sort  to  the  mesh-connected  computers  [Thom77].  With  this 
indexing  scheme,  the  more  often  the  processing  elements  are  required  to 
communicate  with  each  other,  the  closer  they  are  physically  located.  If  the 
sorted  sequence  is  the  final  result,  or  when  the  sorted  items  are  to  be  stored 


in  secondary  memories,  the  row-major  indexing  is  perhaps  preferred.  For 
the  snake-like  row-major  indexing,  it  would  simplify  the  embedding  of  the 
linear  array  connection  for  any  after-sorting  processing. 

Thompson  and  Kung  [Thom77]  designed  the  optimal  adaptation  of  the 
bitonic  sort  with  the  shuffled  row-major  indexing  scheme.  Their  algorithm 
takes  (14  Vn  )  tR  +  (2  log2n)  tc  time^  They  also  proved  that  data  items  can 
be  rearranged  to  obey  other  indexing  schemes  with  a  relatively  insignificant 
extra  cost  of  (Ay/n)tR  time,  provided  that  each  processing  element  can 
store  Vn  data  items,  On  the  other  hand,  Nassmi  and  Sahni  [Nass79]  pro¬ 
posed  different  adapted  algorithms  of  the  bitonic  sort  to  sort  data  items 
into  the  row-major  order  and  the  snake-like  row-major  order.  Their  algo¬ 
rithms  require  (14  Vn  )  tR  +  (2  log2n)(fc  +  tj)  time,  where  tj  is  the  time  to 
interchange  the  contents  of  two  registers. 

The  three  indexing  schemes  all  have  their  own  advantages.  It  is  not 
unusual  for  more  than  one  indexing  scheme  to  be  needed.  One  may  then 
employ  different  algorithms  for  different  indexing  schemes.  For  query 
embedding  on  the  CHiP  computers,  the  shuffled  row-major  indexing  is 
chosen  for  the  bitonic  POP-SORT  for  a  simple  and  efficient  realization  (see 
Chapter  6).  To  perform  join  operations  using  the  bitonic  POP-SORT,  we  pro¬ 
posed  a  join  system  using  a  linear  array  connection  (see  Chapter  3).  Thus, 
rearranging  data  items  into  the  snake-like  row-major  order  is  needed. 

We  shall  present  an  "easier"  Rearrangement  Algorithm  which 
transforms  the  shuffled  row-major  order  into  the  row-major  order  in  less 

than  (2Vr)ig  time.  The  algorithm  requires  only  two  registers  at  each 
t  The  lower  order  terms  are  truncated, 


processing  element.  To  translate  between  the  row-major  order  and  snake- 
like  row-major  order,  (Vn  +  1)  tR  time  is  sufficient.  Since  the  rearrange¬ 
ment  algorithm  is  reversible,  transformation  between  any  two  indexing 
schemes  can  be  done  in  (3  Vn)  tR  time  without  the  requirement  of  extra 
memory  space. 

Suppose  there  are  two  square  regions,  left  and  right  regions,  each  of  i2 
data  items.  Data  items  are  already  sorted  in  row-major  order  in  both 
regions.  All  the  data  items  in  the  right  region  are  larger  than  those  in  the 
left  one.  Let  r10,  tl.  ....  rli_1  denote  the  rows  in  the  left  region  and 
r20,  r2,i . rg.t-i  in  the  right  region.  Algorithm  5-1  describes  a  rearrange¬ 

ment  procedure  which  merges  the  two  regions  into  a  1:2  rectangular  region 
with  the  row-major  indexing  again.  The  rearrangement  procedure,  com¬ 
posed  of  simply  swapping  rows  and  unshuflfling  columns,  constitutes  a  basic 
step  for  the  Rearrangement  Algorithm.  In  Figure  5-2,  an  example  of  merg¬ 
ing  two  4x4  regions  is  shown.  The  triangular  interchange  scheme  shown  in 
Figure  5-3  can  be  used  to  unshuffle  columns  concurrently. 

Algorithm  5-1:  A  basic  rearrangement  step. 

1.  Swap  odd  rows  in  the  left  region  with  even  rows  in  the  right 
region;  r12;+i  ^-*r22;-,  for  jf  =0,1,..,(| — 1).  Time:  (i  +  l)  tR. 

2.  Unshuffle  each  column.  Time:  2(  — -1)  tR. 


Total  time:  (2i-l)  tR. 
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Figure  5-2.  Rearrangement  merge  of  two  4x4  regions. 
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Figure  5-3.  A  triangular  interchange  scheme 
to  perform  unshuffle. 


For  a  square  region  of  n  data  items  with  the  shuffled  row-major  index¬ 
ing,  the  Rearrangement  Algorithm  simply  applies  the  rearrangement  merge 
step  for  logVn  -  2  times,  The  Rearrangement  Algorithm  starts  by  merging 

2x2  regions,  4x4  regions .  and  at  last  -^p-x  ■  regions.  The  total  rear- 

rangement  time  of  (2Vn )  tp  is  calculated  as  follows: 


=  (VrT  +  '  '  '  +  4)  —  (lo 

<  2Vn  —  logVn. 


-I-  (2*2-1) 


+  4)  -  (logVn  -  2) 
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In  the  ease  of  query  embedding,  rearranging  data  items  is  a  require¬ 
ment.  Sorting  of  a  large  number  of  data  items  can  also  have  benefit  from 
the  efficiency  of  the  Rearrangement  Algorithm.  The  sz-way  merge  algorithm 
of  [Thom77]  is  faster  than  the  bitonic  sort  algorithms  of  [Thom77]  and 
[Nass79]  when  n  >  512.  The  sz-way  merge  algorithm  sorts  data  items  into 
snake-like  row-major  order.  It  requires  (6n  +  0(nz/3))  tR  + 
(n  +  0(712/3))  tc  time.  If  tc  <  2  tR  then  (Bn)  tR  is  sufficient  time  for  sorting. 
Sorting  with  the  sz-way  merge  algorithm  and  then  rearranging  data  items 
into  shuffled  row-major  order  takes  less  than  (lln)  tR  total  time.  For  large 
sorting  problems,  it  is  therefore  faster  to  perform  the  sz-way  merge  sort 
first  and  then  some  rearrangement  algorithm  to  translate  to  the  right 
indexing  scheme. 


5.2  Sorting  with  Shadow  Regions 

The  bitonic  sorting  algorithms  of  [Thom77]  and  [Nass79]  assume  a 
square  array  of  mesh-connected  processors.  Performing  the  sorting  algo¬ 
rithms  on  the  CHiP  computers  thus  needs  a  square  region  of  2z,lofi>/"  I  pro¬ 
cessing  elements  for  sorting  n  data  items.  Without  any  extra  effort,  the 
sorting  algorithms  can  also  be  executed  on  a  1:2  rectangular  region.  The 
required  CHiP  region  is  thus  reduced  to  have  2flogn '  area.  However,  there 
are  still  2,,ogn  1  -  n  more  processing  elements  used  than  the  number  of  data 
items  to  be  sorted,  where  0  ^  2flogn  *  -  n  <  n  —  1. 


On  a  rigidly  mesh-connected  computer,  allocation  of  a  larger  processor 
array  to  sorting  a  smaller  number  of  data  items  is  inevitable.  The  sorting 
algorithms  must  also  assume  a  dummy  data  item  (-<»  or  +  »)  at  each 
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additional  processing  element.  On  the  CHiP  computers,  the  processor  inter¬ 
connection  is  flexible  and  configurable.  We  therefore  propose  to  take  advan¬ 
tage  of  the  programmable  switch  lattice  to  resolve  the  superfluous  alloca¬ 
tion  problem  and  handle  the  dummy  value  requirement. 

Suppose  the  bitonic  sorting  algorithm  of  [Thom77]  is  chosen  to  sort 
data  items  into  the  shuffled  row-major  order.  We  present  a  technique,  called 
sorting  with  shadow  regions,  that  requires  the  allocation  of  exactly  n  pro¬ 
cessing  elements.  In  Figure  5-4,  an  example  of  sorting  176  data  items  with 
two  shadow  regions  is  shown.  The  shadow  regions  can  be  allocated  to 
smaller  sorting  jobs  or  solving  other  smaller  problems.  Hence  the  benefit  of 
sorting  with  shadow  regions  is  to  improve  the  utilization  of  the  processing 
elements. 


1.  Each  square  repesents 
a  4x4  mesh-connected 
region. 

2.  17G  +  4x4  +  8x8  =  25G. 


Figure  5-4.  Sorting  176  data  items  with 
4X4  and  8x8  shadow  regions. 

Given  n  data  items,  a  sorting  region  of  n  processing  elements  is  allo¬ 
cated  according  to  the  shuffled  row-major  indices  from  0  to  n-1.  The  region 
consisting  of  the  other  2flogn  '  -  n  processing  elements  is  the  shadow  region. 
If  the  data  items  are  to  be  sorted  in  ascending  order,  the  dummy  value  of 
+  oo  is  chosen.  The  communication  between  the  sorting  region  and  the 
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shadow  region  is  therefore  nothing  but  sending  and  receiving  the  dummy 
value.  Using  the  shadow  region  merely  for  the  trivial  communication  is 
totally  wasteful. 

The  trivial  communication  can  be  simulated  at  those  processing  ele¬ 
ments  on  the  boundary  with  the  shadow  region.  When  those  processing  ele¬ 
ments  are  requested  to  read  a  value  from  the  shadow  region,  they  are  given 
the  dummy  value;  when  they  are  requested  to  write  to  the  shadow  region, 
they  just  ignore  the  request.  An  incomplete  mesh  interconnection  that  con¬ 
nects  only  the  processing  elements  of  indices  from  0  to  n—  1  is  thus 
sufficient.  There  are  no  connections  between  the  sorting  region  and  the  sha¬ 
dow  region.  The  shadow  region  is  free  for  other  use. 

The  technique  of  sorting  with  shadow  regions  can  be  applied  to  allocate 
CHiP  regions  for  sorting  jobs  in  a  more  compact  fashion.  Let  n,  and  n2  be 
the  numbers  of  the  data  items  of  two  sorting  jobs.  Together  for  the  two 

[loginj+Tig)  I 

sorting  jobs,  a  CHiP  region  of  n  -  2'  1  is  allocated.  The  region  of  (0  ~ 

rc^-1)  is  dedicated  to  the  first  job,  and  the  region  of  (n-n2  ~  n-1)  is  dedi¬ 
cated  to  the  second.  They  both  assume  the  regions  not  allocated  to  them¬ 
selves  as  shadow  regions.  They  may  both  choose  the  dummy  value  of  +°°. 
Hence  the  first  sequence  is  sorted  in  ascending  order  and  the  second  is 
sorted  in  descending  order.  Interestingly  the  whole  data  sequence  in  the 
region  of  n  area  becomes  a  bitonic  sequence.  The  benefit  of  applying  the 
technique  of  sorting  with  shadow  regions  will  be  demonstrated  further  in 
Chapter  6. 


5.3  Improvements  on  the  Data  Routing 

The  bitonic  sort  with  the  mesh  interconnection  requires  O(log2n)  com¬ 
parison  time  and  0(Vn)  data  routing  time.  The  comparison  time  is  optimal 
with  respect  to  the  bitonic  sorting  method.  The  data  routing  time,  however, 
is  due  to  the  restricted  communication  power  of  the  mesh  interconnection. 

With  the  mesh  interconnection,  data  communication  between  two  dis¬ 
tant  processing  elements  is  achieved  by  passing  data  over.  To  send  a  data 
item  from  a  processing  element  to  the  other  one  i  locations  apart  thus 
requires  i  routing  steps.  With  the  corridor  width™  and  the  cross-over  capa¬ 
bility  c ,  the  CHiP  computer  may  provide  up  to  ™*c  data  paths  crossing  the 
corridors.  The  availability  of  the  w*c  data  paths  can  be  used  by  the  pro¬ 
cessing  elements  to  communicate  with  each  other  at  a  distance. 

Consider  a  row  of  2 i  processing  elements  and  a  horizontal  corridor 
dedicated  to  the  data  communication  among  the  processing  elements.  The 
bitonic  sort  requires  that  the  i  data  items  at  the  processing  elements  in  the 
left  half  be  sent  to  those  in  the  right  half.  This  needs  at  least  i/w*c  time 
units  since  the  communication  bandwidth  through  the  corridor  is  w*c . 
Hence  any  improvement  in  the  data  routing  on  the  CHiP  computers  over  the 
mesh  interconnection  is  bounded  by  the  speed-up  factor  w*c . 

In  additional  to  performing  the  passing-over  type  of  communication, 
the  CHiP  computers  can  improve  the  communication  power  over  the  mesh 
interconnection  in  the  following  ways: 
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1.  Communication  with  direct  connections.  It  is  feasible  to  provide 
direct  connections  for  all  the  communication  requirements  of  the 
bitonic  sort.  A  possible  cost  is  0(log3n)  reconfiguration  steps  and 
O(logn)  switch  settings.  Moreover,  data  transmission  through  paths 
of  sigmficantly  different  lengths  needs  careful  synchronization. 
However,  cautious  application  of  direct  connections,  e.g.  for  short 
distance  communication,  can  avoid  the  complication  of 
reconfiguration  and  synchronization. 

2.  Communication  with  z-location-jumps.  Direct  connections  for  short 
distance  communication  also  provide  short  cuts  for  long  distance 
communication.  Direct  connections  between  processing  elements  z 
locations  apart  can  be  used  to  communicate  processing  elements 
i*z  locations  apart  in  i  steps  of  z -location-jump. 

On  different  switch  lattices,  or  different  values  of  w  and  c,  we  expect 
some  variations  in  reaching  the  optimal  improvement  on  the  data  routing. 
We  are  most  interested  in  the  practical  values  of  w  and  c  which  are  \<w<d 
and  l<c  <4  (the  degree  of  incident  data  paths  to  switches  d<8).  An  exam¬ 
ple  of  w  =c  =  2  and  d  =  8  which  achieves  the  speed-up  factor  w*c  shall  be 
demonstrated.  For  other  values  of  w  and  c  in  which  we  are  interested,  the 
improvement  on  the  data  routing  can  be  done  in  a  similar  way. 

With  the  switch  lattice  of  id  =  c  =  2  and  d  =  8,  we  design  three  intercon¬ 
nection  patterns:  / 1  3,  [ihl  and  Iiv.  Figure  5-5  shows  three  sub-patterns 
which  are  superimposed  to  form  the  pattern  Ix  z.  The  interconnection  /,  2 
provides  direct  connections  required  between  PEs  one  or  two  locations 


apart,  both  horizontally  and  vertically.  In  other  words,  7*2  maps  the  neces¬ 
sary  connections  for  the  bitonic  stages  1~4  onto  the  CHiP  switch  lattice. 
The  interconnection  I4h  provides  direct  connections  for  PEs  four  horizontal 
locations  apart,  and  I4v  for  PEs  four  vertical  locations  apart.  They  can  be 
layout  using  the  lacing  technique  as  in  Appendix  B.  The  three  interconnec¬ 
tion  patterns  together  map  the  necessary  connections  for  the  bitonic  stages 
from  1  to  6.  For  bitonic  merge  stages  7,  8,  and  so  on,  the  interconnection 
patterns  I4h  and  I4v  can  be  used  to  speed  up  the  data  routing  with  the  4- 
location-jumps. 


Figure  5-5.  The  interconnection  pattern  71>2  composed 
of  three  sub-patterns:  (a)  7lP  (b)  /g*.  and  (c)  7^. 


To  perform  the  bitonic  sort  with  the  three  interconnection  patterns, 
the  comparison  steps  remain  the  same.  The  routing  steps  and  the 
reconfiguration  steps  are  analyzed  as  follows  (w  =  c  =  2). 


l0&n^  2h/s]-i  _  Jv^T 

routing  steps  =  £  £ 

logn 

reconfiguring  steps  =  3  =  0(log2n) 


f.  f.f.  f 
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When  n  is  large  the  asymptotic  speed-up  factor  is  w*c .  If  we  employ  only 
the  interconnection  pattern  / 1  2  then  the  reconfiguration  steps  are  reduced 
to  one  but  the  speed-up  factor  becomes  w*c  /  2. 

5.4  K-fold  Sorting 

External  sorting  is  expensive.  The  problem  of  sorting  more  data  items 
than  the  number  of  processing  elements  is  thus  important.  Knuth 
addressed  that  problem  in  [Knut73,  p.241-242].  He  pointed  out  that  a  sort¬ 
ing  network  of  n  data  items  can  be  generalized  to  sort  k*n  data  items  if  the 
comparison  operation  is  replaced  by  a  k- way  merge  operation.  This  general¬ 
ization  idea  was  applied  to  several  sorting  algorithms  by  G.  Baudet  and  D. 
Stevenson  in  [Baud78]. 

To  sort  k*n  data  items  on  n  processing  elements,  the  data  items  are 
initially  distributed  evenly  to  each  processing  element.  The  data  sequence 
at  each  processing  element  is  then  sorted  locally.  Now,  the  sequence 
Q  -  Qi\  Qz',  ■  Qn  is  partially  ordered,  where  £i  is  the  sorted  sequence  of  k 
elements  at  PEi .  For  any  sorting  algorithm  using  only  the  comparison- 
interchange  operation,  Baudet  and  Stevenson  proposed  that  it  can  be  gen¬ 
eralized  to  sort  the  partially  ordered  sequence  Q  by  substituting  the 
comparison-interchange  operation  with  a  merge-splitting  operation.  Per¬ 
forming  the  merge-splitting  operation  on  two  sequences  Qi  and  Qj  is  to 
merge  the  two  sequences  and  split  into  halves  to  produce  the  new 
occurrences  of  <?*  and  Qj. 


Assume  that  m  is  the  local  memory  size  of  processing  elements  on  a 
CHiP  computer,  that  is,  each  processing  element  can  hold  m  data  items. 


T  11 
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The  internal  memory  capacity  is  computed  as  m*n,  provided  that  a  CHiP 
region  of  n  processing  elements  is  allocated  for  the  sorting.  Only  when  the 
data  items  to  be  sorted  exceed  the  internal  memory  capacity  should  we 
resort  to  external  sorting,  However,  Baudet  and  Stevenson’s  generalization 

Tfi 

method  does  not  work  when  —  <  k  <  m  since  the  merge-splitting  operation 

needs  at  least  2k  working  space ^  at  each  processing  element.  We  therefore 
consider  two  indexing  schemes  for  the  bitonic  sort  to  sort  as  many  as  m*n 
data  items  on  n  processing  elements.  The  comparison-interchange  opera¬ 
tion  does  not  have  to  be  replaced  by  a  merge-splitting  operation;  we  simply 
perform  k  comparison-interchange  steps. 

The  two  indexing  schemes  are  extensions  to  the  shuffled  row-major  one. 
The  processing  elements  are  still  indexed  in  the  shuffled  row-major  order. 
Since  there  are  k  data  items  at  each  processing  element,  we  need  to  index 
further  those  data  items  at  the  same  processing  element.  Data  items  may 
be  indexed  in  the  following  ways; 

(1) .  Aggregation  scheme  -  Index  those  data  items  at  PEi  as  i*k, 

i*k  + 1,  ....  i*k  +(A:  —  1). 

(2)  Projection  scheme  -  Index  those  data  items  at  PEi  as  i,  n+i, 

....  (A -1  )*n+i. 


Baudet  and  Stevenson  used  3 k  working  space  at  each  processing  element. 


(a) 

Figure  5-6.  Indexing  16  data  items  on  4  processing  elements: 

(a)  Aggregation  scheme,  and  (b)  Projection  scheme. 

Assume  that  both  k  and  n  are  powers  of  2 ,  k  =  7P  and  n  =  2* .  To  sort 
the  k*n  data  items  using  the  bitonic  sort,  p+q  stages  of  the  bitonic  merge 
are  required.  With  the  aggregation  scheme,  the  first  p  stages  are  to  sort 
locally  each  sequence  Qi  of  k  elements  at  PE{.  Then,  q  more  stages  are  per¬ 
formed  to  sort  the  partially  ordered  sequence  Q  -  Q\\  Qz\  ...;  Qn  •  At  the 
(p+l)-th,  ...  and  (p  +  g)-th  stages,  they  all  perform  a  local  execution  of  the 
first  p  bitonic  stages  (see  Figure  A-l). 

The  bitonic  sort  with  the  aggregation  scheme  may  be  modified  in  two 
ways.  Each  execution  of  the  first  p  bitonic  merge  stages  can  be  replaced  by 
a  faster  local  sort.  To  perform  k  comparison-interchange  steps  between  two 
processing  elements  may  be  improved  by  some  overlapping  of  read/write 
and  comparison  instructions.  Let  c0  he  the  local  sorting  time  of  k  data  items 
at  each  processing  element,  and  ct  be  the  time  saved  by  integrating  the  fc 
comparison-interchange  steps.  If  the  bitonic  sort  is  directly  applied  without 

k 

any  modification  then  c0  =  ^-(log2fc  +  log  A:)  and  c  j  =  0. 


5 


Define  T(2l,fc)  to  denote  the  time  required  to  merge  the  k*  2*  data 
items  in  the  processing  elements  from  0  to  2<-l1  and  S(2Zj,*:)  the  time  to 
sort  the  k*  2Zj  items  in  the  processing  elements  from  0  to  2Z*-1.  Notice 
that  T(l,*:)  =  0  and  S(l,*:)  =  c0*tc.  We  analyze  the  time  complexity  of  the 
bitonic  sort  with  the  aggregation  scheme  in  the  following: 


r(i.fc)  =  o, 

7(2*,*:)  =  7(2*-1,*:)  +  (fc* 2,t/zl - c ,)  tR  +  k  tc. 


5,(1,*:)  =  c0*tc, 

5,(2Zj  ,k)  =  51(2Z^-1>,*:)  +  7(  Z2>~\k)  +  7(2^  .*). 


(5.1b) 


Solve  the  recurrences  5.1a  for  the  merge  time  function, 


7(2*,*:) 


[*:(3*2(*  +  1)/z-4)-i*c,]  tR  +  i*k  tc,  if  i  is  odd 
[4fc(2i/z-  1)  -i#c,]  tR  +  i*k  tc,  if  i  is  even. 


I 

(5.2) 


Substitute  the  above  equation  into  equation  5.1b, 

'5,(1,*:)  =  c0*tc, 

5,(2Zj,*:)  =  5,(2Z>-1,*:)  +  [7*:*2> -8k  -(4j-l)c,]  tR  +  (4;-l)ifc  tc. 


Solve  the  recurrences  for  the  sorting  time  function, 


5 ,(n  ,*: )  =  [14*:  (VrT  -1)  -  y-logzn  -  ---*”■  ---log  n  ]  tR  + 


r*:,  a  .  (c0+*:) 

L-logzn  +  ~ — 


■logn]  tc. 


(5.3) 


With  the  projection  scheme,  the  first  q  stages  are  equivalent  to  per¬ 
forming  k  runs  of  the  bitonic  sort  of  n  data  items  on  n  processing  elements. 
From  another  point  of  view,  the  first  q  stages  with  the  projection  scheme 
are  the  same  as  with  the  aggregation  scheme  except  with  c0  =  0.  The  nextp 
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stages  are  to  merge  the  sequence  9  =  <?i:  £2;  ...;  9*.  where  £?<  is  a  sorted 
sequence  of  n  data  items  over  the  n  processing  elements.  Assume  that 

both  p  and  q  are  even  numbers  {k  =  2P  and  n  =  27).  The  required  merge 

k 

time  at  the  (g+l)-th  stage  is  —  tc  +  T(n,k),  and  k  tc  +  T{n,k)  at  the 
( q  +2)-th  stage.  Let  VS  be  the  time  for  the  p  merge  stages. 

VS(n,k)  =  (l  +  2  +  ...+  |)(|-*  tc  +  2  T{n,k)) 

=  (log2*:  +  21og/c)[>(N/n  -1)  -  -y-logn]  tR  + 

(log2*:  +  21og*)[ Yg~+  |“logn]  tc 

The  total  time  for  the  bitonic  sort  with  the  projection  scheme  is  thus 

Sz{n,k)  =  S i(n .*:)  |c„  =  0  +  VS{n,k).  (5.4) 

Comparing  the  equations  5.3  and  5.4,  one  finds  that  the  comparison 
time  might  be  reduced  with  the  projection  scheme,  but  the  data  routing 
time  is  definitely  increased.  Data  routing  time  is  the  dominating  factor  in 
the  time  complexity  of  the  bitonic  sort  with  the  mesh  interconnection. 
Notice  that  when  A:  =  1,  c0  =  c ,  =  VS(n,k)  =  0.  In  summary, 

5j(A:*«,l)  =  S2(fc<m,l)  =  0{^k*n  )  tR  +  0(log2*:  +  log2n)  tc 

Si(n,fc)  =  0{k  Vn  )  tR  +  0{k  log2n  +  c0logn)  tc 

Sz(n,k)  =  0(k  log zk  Vn)  tR  +  0{k  log2n  +k  log2*:  logn)  tc 

We  conclude  that  the  A: -fold  bitonic  sort  with  the  aggregation  scheme  out¬ 
performs  the  projection  scheme  assuming  c0<  0{k  log2*:).  The  saving  fac¬ 
tor  c,  does  not  have  any  significant  effect  on  the  time  complexities.  The 
aggregation  scheme  emphasizes  data  locality  while  the  projection  scheme 


emphasizes  parallelism.  The  former  attempts  to  reduce  the  routing  steps 
and  the  latter  attempts  to  reduce  the  comparison  steps.  Only  when 
c q  -*  0(kz)  and  k  is  large  may  the  projection  scheme  be  better  than  the 
aggregation  scheme.  In  that  situation,  the  sequential  sorting  time  c0  cannot 
be  compensated  by  improving  data  locality. 


CHAPTER  6 


QUERY  EMBEDDING 


Partitioning  a  large  problem  into  several  small  and  more  tractable  sub¬ 
problems,  or  divide-and-conquer,  is  a  common  approach  in  computing 
theory  and  practice.  Subproblems  are  often  referred  to  as  basic  operations. 
Existing  algorithms  for  the  basic  operations  are  then  applicable  to  solving 
many  large  problems.  When  each  subproblem  is  very  efficiently  solved  by 
highly  parallel  hardware,  one  interesting  question  is:  What  is  the  relative 
overhead  of  data  movement  among  the  basic  operations? 

One  benefit  of  the  CHiP  computer  is  to  imitate  the  performance 
efficiency  of  algorithmically  specialized  processors  on  the  same  devices. 
Owing  to  its  configurable  switch  lattice,  the  CHiP  computer  is  capable  of 
embedding  suitable  interconnections  for  performing  different  algorithms 
efficiently.  To  solve  a  large  and  computationally  intensive  problem,  several 
algorithms  are  usually  involved.  The  configurability  of  the  CHiP  computer 
also  provides  a  potential  for  composing  those  algorithms  without  producing 
any  bottleneck  of  data  movement  [Snyd82]. 

Composing  algorithms  includes  the  embedding  of  interconnections  on 
the  switch  lattices  for  individual  algorithms  and  the  embedding  for  harmoni¬ 
ous  interaction  among  the  algorithms,  In  [Snyd82],  an  example  of  solving  a 
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system  of  linear  equations  is  demonstrated.  To  solve  the  problem,  one 
might  need  an  algorithm  for  the  LU-deeomposition  of  the  coefficient  matrix 
and  a  linear  recurrence  solver  to  perform  the  backward  substitution. 
Snyder  showed  the  interconnection  embeddings  on  a  switch  lattice  which 
put  together  Kung  and  Leiserson's  LU-decomposition  algorithm  and  a  sys¬ 
tolic  method  for  the  backward  substitution  [MeadOl,  Ch.0.3], 

To  evaluate  a  database  query,  several  database  operations  are  usually 
invoked.  Many  efficient  algorithms  exist  for  implementing  those  database 
operations.  Like  solving  a  system  of  linear  equations,  techniques  of  compos¬ 
ing  algorithms  might  also  be  able  to  solve  query  evaluation  effectively.  How¬ 
ever,  I/O  and  data  flow  in  query  evaluation  are  much  more  complex.  It  is  a 
mulliphased  problem  that  takes  multiple  relations  as  operands  (possibly  at 
different  time)  and  produces  a  single  relation  as  result.  The  problem  struc¬ 
ture  as  well  as  the  problem  size,  moreover,  varies  for  different  database 
queries. 

Query  embedding  is  the  idea  of  embedding  suitable  interconnections  in 
order  to  process  whole  queries  on  the  CiliP  computer.  It  involves  allocating 
a  CHiP  region  and  providing  appropriate  interconnections  for  efficiently 
inputting  relations,  solving  the  multiphased  problem,  and  outputting  the 
result,  if  query  embedding  is  done  in  such  a  way  as  to  embed  individual 
operations  separately  and  then  to  compose  them  together,  the  interconnec¬ 
tions  for  routing  results  from  one  operation  to  the  next  operation  may  be 
far  from  being  realistically  embeddable.  Fortunately  the  primitive  opera¬ 
tion  POP-S0KT  which  unifies  many  database  operations  gives  a  hope  to  avoid 
this  difficulty.  Composing  algorithms  would  simply  become  putting  different 
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runs  of  the  same  algorithm  together.  The  bitonic  POP-SORT  is  especially 
suitable  for  query  embedding  since  it  works  in  a  particularly  ’regular 
manner.  Actually,  query  embedding  can  be  simplified  significantly  if  data¬ 
base  operations  are  implemented  by  the  bitonic  POP-SORT. 

In  this  chapter  we  shall  explore  techniques  of  query  embedding  for  pro¬ 
cessing  whole  queries  in  a  highly  parallel  fashion  on  the  CHiP  computer.  In 
Section  6. 1  we  parse  algebraic  queries  into  operation  trees  and  discuss  a 
general  scheme  of  embedding  the  operation  trees.  Taking  advantage  of  the 
unification  provided  by  the  primitive  operation  POP-SORT,  we  then  demon¬ 
strate  how  simple  query  embedding  can  be  done  for  a  restricted  type  of 
operation  trees  in  Section  6.2.  Section  6.3  summarizes  some  general  stra¬ 
tegies  for  improving  query  embedding.  We  also  extend  the  embedding  tech¬ 
niques  to  evaluate  all  the  algebraic  operation  trees  in  Section  6.4. 

6.1  Embedding  of  Operation  Trees 

Query  languages  for  the  relational  data  model  are  based  on  two  types  of 
abstract  languages:  relational  algebra  and  relational  calculus  [UllmBO,  Ch.4j. 
Both  abstract  languages  are  equivalent  in  expressive  power;  calculus  expres¬ 
sions  can  always  be  translated  into  algebraic  expressions.  It  is  trivial  that  an 
algebraic  expression  can  be  parsed  into  a  tree  of  algebraic  operations  (Fig¬ 
ure  6-1).  In  the  operation  trees,  internal  nodes  represent  algebraic  opera¬ 
tions  and  external  nodes  represent  input  relations.  At  the  root  node  a  final 
operation  is  performed  and  the  result  3  produced. 

Existing  query  languages  are  not  necessarily  the  exact  implementations 
of  the  abstract  ones.  They  may  have  certain  extensions  to  the  abstract 


Figure  6-1.  An  operation  tree  from  parsing  a  query. 


languages,  e.g.  transitive  closure,  fixed  point,  and  looping.  In  the  most  gen¬ 
eral  case,  queries  are  arbitrary  functions  on  relations.  Given  any  query,  one 
can  still  represent  it  as  an  operation  tree,  but  the  operations  are  no  longer 
restricted  to  algebraic  ones.  Nevertheless  the  abstract  languages  serve  as  a 
benchmark  for  evaluating  existing  query  languages.  Efficient  evaluation  of 
algebraic  operation  treps  is  thus  very  important  in  achieving  fast  query  pro¬ 
cessing. 

Since  queries  are  represented  and  processed  as  operation  trees,  query 
embedding  on  the  CHiP  computer  is  reduced  to  the  embedding  of  operation 
trees.  To  evaluate  whole  queries  by  embedding  operation  trees,  a  wide  spec¬ 
trum  of  parallelism  is  possible.  We  may  have  inter-operation  and  intra¬ 
operation  parallelism  in  evaluating  an  operation  tree.  We  may  also  have 
inter-query  parallelism  if  the  CHiP  computer  is  big  enough  to  host  several 
queries.  Furthermore,  the  I/O  overhead  can  be  minimized  when  whole 
queries  are  evaluated  on  the  CHiP  computer.  Intermediate  results  tend  to 
be  kept  in  the  CHiP  processor,  and  therefore  the  data  swapping  between  the 
CHiP  processor  and  its  external  storage  may  be  eliminated.  The  ideal  case 
occurs  when  no  I/O  request  is  issued  besides  loading  input  relations  onto  the 


CHiP  processor  and  outputting  the  result  relation. 


Figure  6-2.  A  general  scheme  of  composing  algorithms 
(operations)  for  query  embedding. 


To  embed  and  execute  an  operation  tree,  a  contiguous  CHiP  region  is 
allocated.  Within  the  region,  interconnections  are  to  be  provided  to  perform 
the  whole  operation  tree  efficiently.  A  general  scheme  of  doing  this  is  as  fol¬ 
lows  (Figure  6-2). 

•  First,  allocate  regions  for  embedding  algorithmically  specialized  inter¬ 
connections  to  perform  individual  operations. 


•  Secondly,  tailor  those  regions  as  compactly  as  possible  according  to  the 
I/O  requirements  and  the  communication  requirements  among  the 
operations. 

The  CHiP  region  allocated  for  the  whole  operation  tree,  called  the  query 
region,  is  thus  partitioned  into  three  type  of  regions:  operation  regions ,  con¬ 
nection  regions,  and  I/O  regions,  Operation  regions  are  those  allocated  to 
embedding  suitable  interconnections  for  running  efficient  algorithms  of  the 
operations.  Connection  regions  are  those  allocated  to  providing  data  paths 
from  operations  to  operations.  I/O  regions  connect  some  of  the  operation 


regions  to  the  CHiP  perimeter  where  the  CHiP  processor  is  connected  to  ;ts 
external  storage  devices. 

Two  obvious  optimization  objectives  for  query  embedding  are  to  reduce 
the  query  region  and  to  minimize  the  total  time  for  evaluating  the  whole 
operation  trees  To  minimize  the  query  region,  operation  regions  should  be 
kept  as  small  as  possible  and  they  should  be  packed  in  such  a  way  that  the 
needed  I/O  regions  and  connection  regions  are  also  small.  To  minimize  the 
total  time,  we  want  the  query  region  to  be  large  enough  to  provide  intercon¬ 
nections  for  performing  efficient  algorithms  and  putting  them  together.  The 
twc  objectives  may  not  be  achievable  together.  As  space-time  tradeoff  is  a 
common  phenomenon  in  computing  world,  we  may  also  find  the  trade-off 
between  the  two  objectives. 

The  general  scheme  of  embedding  operation  trees,  as  shown  in  Figure 
6-2,  provides  a  basic  strategy  of  query  embedding.  Only  when  there  is  no 
better  way  would  we  resort  to  the  general  embedding  scheme,  since  the  gen¬ 
eral  scheme  is  exposed  to  the  following  problems: 

•  The  size  of  the  result  relation  after  preforming  an  operation  depends  on 
the  operatic  itself  and  the  distribution  of  data  values.  The  amount  of 
significant  data  items  shrinks  and  swells  during  the  query  processing. 
It  is  nontrivial  to  allocate  CHiP  regions  for  the  later  operations. 

•  Large  1/0  regions  are  sometimes  necessary.  For  example,  a  wide 
bandwidth  is  needed  in  order  that  OIJ2  can  read  in  fast,  and  the  dat  a 
paths  may  be  long  if  0PZ  is  buried  far  away  from  the  CHiP  perimetei . 
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•  Optimal  algorithms  to  pack  operation  regions  are  extremely  difficult  to 
find.  Heuristics  may  result  in  requesting  large  connection  and  I/O 
regions. 

6.2  Buddy  System  Allocation 

Due  to  the  dynamic  character  of  query  processing  in  which  the  amount 
of  significant  data  items  varies,  the  allocation  of  operation  regions  is  subject 
to  dynamic  strategies.  Unlike  memory  allocation,  dynamic  allocation  of 
CHiP  regions  entails  dynamic  control  of  processing  elements  and  dynamic 
provision  of  interconnections.  Moreover,  problems  like  deadlock  prevention, 
communication  blockade  between  parent  and  child  regions  must  be  solved. 
To  avoid  the  dynamic  complication  we  therefore  present  area-effective 
static  allocation  policies  in  this  section. 

The  bitonic  POP-SORT  which  is  an  efficient  primitive  for  many  database 
operations  also  yields  a  nice  solution  to  query  embedding.  In  this  section  we 
demonstrate  how  well  the  bitonic  POP-SORT  can  simplify  query  embedding 
on  the  CHiP  computer.  A  restricted  type  of  operation  tree  is  considered.  In 
Section  8.4,  the  embedding  techniques  are  then  extended  to  evaluate  all 
algebraic  operation  trees. 

Operation  trees  considered  here  may  contain  some  or  all  of  the  follow¬ 
ing  algebraic  operations:  restriction,  projection,  duplicate-removal,  union, 
intersection,  difference,  and  join.  Cartesian  product  and  quotient  are  two 
useful  algebraic  operations  being  left  out.  These  two  operations  are 
extremely  difficult  in  nature.  They  are  not  in  the  scope  directly  implement- 
able  by  the  primitive  operation  POP-SORT.  Fortunately,  quotients  are  not 


often  executed  in  query  processing  and  Cartesian  products  can  often  be 
replaced  by  joins  [Wong76].  Hence  the  restricted  type  of  operation  trees 
still  covers  quite  a  portion  of  database  queries. 

Any  POP-SORT  has  the  following  two  important  features: 

•  It  employs  marking  functions  to  mark  off  all  the  unwanted  data  items. 

•  It  works  well  even  with  some  marked  and  unwanted  data  items  in  the 
input. 

Suppose  that  there  are  parent  and  child  operations  which  are  all  imple¬ 
mented  by  POP-SORT,  The  child  operations  can  send  the  whole  chunk  of 
data  possibly  consisting  of  unwanted  and  marked  items  to  the  parent.  The 
parent  operation  can  then  carry  on  without  worrying  about  the  marked-off 
items.  These  operations  thus  can  be  allocated  CHiP  regions  by  some  static 
strategies. 

Based  on  the  two  features  of  POP-SORT,  another  two  valuable  observa¬ 
tions  are: 

Observation  1.  Restrictions  would  just  produce  more  marked  items, 
and  projections  would  reduce  the  tuple  length.  They  can  be  com¬ 
bined  with  other  operations  that  precede  or  follow  them. 

Observation  2.  Remove-duplicates  are  already  combined  with  union, 
intersection,  and  difference  due  to  the  versatility  of  POP-SORT.  Thus 
the  duplicate-removal  before  or  after  these  three  operations  is 
redundant. 

By  merging  the  internal  nodes  according  to  the  above  observations,  opera¬ 
tion  trees  are  shrunk  to  have  only  external  nodes  and  those  internal  ones 


(or  operations  excluding  restriction  and  projection  (and  maybe  duplicate- 
removal.)  The  allocation  of  operation  regions  now  becomes  the  allocation  for 
a  smaller  number  of  internal  nodes. 

The  bitonic  POP-SORT,  in  particular,  works  in  a  very  regular  manner.  As 
for  query  embedding,  mesh  interconnection  is  chosen  for  the  primitive 
operation  in  order  to  keep  the  operation  regions  small.  Data  items  are 
assumed  to  be  sorted  into  shuffled  row-major  order.  For  joins,  data  items 
are  then  rearranged  into  snake-like  row-major  order  with  a  relatively 
insignificant  overhead  (Chapter  5.1).  In  addition  to  the  two  features  men¬ 
tioned  before,  the  bitonic  POP-SORT  has  another  very  important  one: 

•  Assuming  mesh  interconnection  and  shuffled  row-major  indexing,  the 
bitonic  POP-SORT  works  well  on  a  square  region  or  a  1:2  rectangular 
region. 

It  is  this  feature  that  makes  algorithms  composition  very  simple.  More 
observations  implied  by  this  feature  are  as  follows: 

Observation  3.  The  CHiP  region  for  the  parent  operation  can  be  over¬ 
laid  with  its  child  operation  regions.  No  connection  regions  are 
necessary  because  data  items  are  always  in  positions  ready  for  next 
operation. 

Observation  4.  The  parent  operation  may  only  need  to  execute  a 
stage  of  the  bitonic  merge  instead  of  the  whole  sorting  procedure 
since  the  child  operations  would  have  sorted  the  data  items  in  their 
regions. 


As  an  analogy  to  the  buddy  system  for  dynamic  storage  allocation 
[Knut73,  I.  Chapter  2.5],  we  present  a  buddy  system  for  static  allocation  of 
operation  regions.  Each  operation  node  is  allocated  a  CHiP  region  of  size  a 
power  of  2.  The  two  child  operation  regions  are  buddies.  We  do  not  insist 
that  buddies  be  equal  in  size,  but  buddies  must  be  located  together.  Merging 
two  buddies  becomes  a  larger  region  and  the  larger  region  is  for  the  parent 
operation. 


Algorithm  6-1:  Buddy  system  allocation. 

A.l.  Compute  area  for  each  internal  leaf  node;  n*  + 
where  and  n;-  are  sizes  of  input  relations. 

A.2.  Compute  areas  for  remaining  input  relations;  n* 


n,  -> 

„  jhH 


A.3.  Compute  areas  for  parent  nodes  from  areas  of  child  nodes 
(buddies);  2 i  +  2J  -*  2m“*^  ^+l,  where  2*  and  2s  are  areas  of 
buddies. 


B.l.  Allocate  a  query  region  for  the  root  node. 

B.2.  Allocate  one  half  of  the  parent  region  to  each  of  its  child 
regions. 


Figure  6-3.  An  example  of  buddy  system  allocation. 


Algorithm  6-1  for  buddy  system  allocation  is  composed  of  two  phases.  First, 
compute  the  areas  of  operation  regions  from  the  operation  tree's  bottom 


up.  Secondly,  allocate  operation  regions  from  the  top  down.  In  Figure  6-3  we 
show  an  example  of  buddy  system  allocation. 


1.  POP-SORT 
(OPr) 


2.  post-sorting 
processing 


-> 


3.  bitonic  merge 
(OPz.OP3) 


4.  post-sorting 
processing 


5.  bitonic  merge 

(OP  4) 


6.  post-sorting 
processing 


3 

EH 

□ 


perform  POP-SORT  in  the  region 

perform  bitonic  merge  to  merge  two  regions 

idle 

perform  post-sorting  processing  in  constant  time 


Figure  6-4.  An  example  of  processing  a  class  of 
queries  using  the  bitonic  POP-SORT. 


Processing  the  whole  example  query  is  partitioned  into  six  phases  in 
figure  6-4.  Phases  1  and  2  would  complete  the  execution  of  0P\,  phases  3 
and  4  would  complete  0PZ  and  0PZ  concurrently,  and  so  on.  In  phase  1,  the 
bitonic  POP-SORT  is  performed  in  each  "circled”  region.  In  phase  2,  the 
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post-sorting  processing  is  executed  in  the  "crossed"  region,  and  the  rest 
region  does  nothing  but  wait.  In  phase  3,  one  stage  of  the  bitonic  merge  is 
sufficient  since  the  data  items  in  the  circled  regions  are  already  ordered  as 
bitonic  sequences. 

It  is  surprising  that  the  muitiphased  query  processing  can  be  viewed  as 
a  big  sorting  job  interleaved  with  other  processes.  The  post-sorting  process¬ 
ing  for  union,  intersection,  or  difference  requires  only  constant  time  (see 
Chapter  3.1).  For  queries  involving  only  these  operations  and  restriction, 
projection,  and  duplicate-removal,  the  muitiphased  query  processing  thus 
works  exactly  like  a  big  sorting  job,  except  with  some  constant  time  pro¬ 
cessing.  It  guarantees  a  total  processing  time  of  0(V9ar«o).  where  Qma  is 
the  area  of  the  query  region. 

Notice  that  the  bitonic  POP-SORT  is  also  very  helpful  for  some  of  the 
equi-join  or  natural  operations.  The  post-sorting  processing  for  those  joins 
would  involve  (l)  reordering  from  shuffled  row-major  index  to  snake-like 
row-major  index,  (2)  performing  easy-catch  process,  (3)  running  the  sprin¬ 
kle  algorithm  to  resolve  the  hot  spots  problem,  and  (4)  restoring  the  order 
of  data  items  by  running  the  bitonic  POP-SORT  again.  For  quite  a  few  practi¬ 
cal  cases  of  joins,  the  post-sorting  processing  can  be  done  in  0(Vn )  time, 
where  n  is  the  area  of  the  region  on  which  the  join  operation  is  performed. 
Thus  the  total  processing  time  is  still  0(VG£^).  However,  not  all  join 
operations  work  well  this  way.  We  shall  address  this  problem  in  Section  6.4. 

The  allocation  algorithm  6-1  does  not  attempt  to  pack  input  relations  in 
a  compact  fashion.  It  packs  better  when  buddies  are  about  of  the  same  size. 
However  there  is  a  worst  case  anomaly: 
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Let  rij  =  2*.  and  n8  =  n3  =  n4  =  ns  =  1. 

=  =>  Total  input  size  is  /,<*  =  2k  +  4. 

Query  region  has  area  =  2**3  (A  =  3). 

7 

==>  Nearly  —  of  the  query  region  is  wasted! 

O 


In  general,  the  area  of  a  query  region  allocated  by  Algorithm  6-1  is  within  a 
range  as  follows: 

Igia  £  Qarta  ^  2*1  *Igia  ■ 

To  resolve  the  anomaly,  we  propose  two  approaches.  One  is  to  parse  queries 
into  "good"  trees  which  would  lead  to  more  compact  allocations.  This  is  dis¬ 
cussed  in  Section  6.3  -  query  amelioration.  The  other  approach  is  to  modify 
the  buddy  system  allocation  to  pack  input  relations.  A  modified  version  of 
the  buddy  system  allocation  is  shown  in  the  following  as  Algorithm  6-2.  This 
improved  algorithm  represents  a  packing  technique  based  on  the  feature 
that  the  bitonic  POP-SORT  works  well  with  shadow  regions  (see  Chapter  5.2) 


Algorithm  6-2:  Modified  buddy  system  allocation. 


A.l.  Transform  the  binary  operation  tree  into  a  quaternary  tree  as 
shown  in  Figure  8-5. 


B.l.  Compute  area  for  each  internal  leaf  node; 

[log  (rtn+wu+nM+T^g)  I 

«*i  +nt2  +  +  »Vz  -*  2'  where  nt j,  nt8  are 

sizes  of  left  input  relations,  and  n^,  tvz  are  sizes  of  right  input 
relations. 


B.2. 


Compute  areas  for  remaining  input  relations;  r it 


_»  2[,0*M 


B.3.  Let  2*1,  212,  2rl,  2rZ  be  areas  for  child  nodes  (buddies).  Compute 
areas  for  parent  nodes  from  areas  of  child  nodes; 
2* 1  +  2*8  +  2rl  +  2rZ  ->  2‘,  where  2*  is  the  smallest  area  that  is 
large  enough  for  all  the  four  buddies. 


I 
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C.l.  Allocate  a  query  region  for  the  root  node. 

C.2.  Let  O^-l  be  a  parent  region.  Allocate  the  regions  2il.  212  sub¬ 
sequently  from  0  to  2*-l,  if  1 1  >  12.  Allocate  the  regions  2pl,  2rZ 
subsequently  from  2*-l  to  0,  if  r  1  >  r2. 


Figure  &-5.  Transforming  an  operation  tree  into  a 
quaternary  tree  for  more  compact  allocation. 


The  modified  allocation  algorithm  tries  to  pack  input  relations  by 
grouping  four  buddies  instead  of  just  two.  The  success  of  the  packing  tech¬ 
nique  again  relies  on  the  relative  sizes  of  buddies.  However,  the  upper 
bound  of  Qma  is  already  improved  significantly.  In  general. 

Isiz  *  Qarua  < 

since  the  height  of  a  quaternary  tree  is  reduces  to  h/2.  Two  child  opera¬ 
tions  might  be  allocated  regions  of  different  areas  by  the  packing  attempt. 
The  synchronization  between  two  child  operations  therefore  cannot  count  on 
the  allocation  of  regions  of  the  same  area  any  more.  The  operation  in  a 
smaller  region  would  need  to  wait  for  the  other  to  finish. 

Viicoreucaliy  speaking,  sorting  with  shadow  regions  can  be  exploiteu  Io 
a  very  complicated  extent.  It  is  then  possible  to  compact  input  relations 
further.  Nevertheless,  the  more  compactly  input  relations  are  packed,  the 
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more  complicated  the  required  synchronization  tends  to  be.  It  might  not  be 
worthwhile  to  pursue  more  compact  packing  since  the  operation  trees  con¬ 
sidered  are  more  likely  small,  i.e.  the  value  of  h  is  small.  Moreover,  we  may 
turn  to  query  amelioration  for  more  compact  allocations. 

6.3  Query  Amelioration 

The  total  CHIP  region  and  the  total  processing  time  are  the  two  objects 
to  be  minimized  for  query  embedding  on  the  CHiP  computer.  Although 
"query  optimization"  is  commonly  used,  query  evaluation  is  not  necessarily 
optimized  over  all  the  possible  inputs.  The  term  "query  amelioration"  would 
be  more  appropriate  [UllmBO],  especially  when  there  are  two  interacting 
"optimization"  objectives.  Our  query  amelioration  philosophy  is  to  reduce 
the  total  time  while  still  keeping  the  query  region  small.  In  this  section,  we 
shall  summarize  some  general  strategies  for  query  amelioration.  These  stra¬ 
tegies  are  from  two  sources:  some  by  re-phrasing  the  algebraic  expressions, 
the  others  based  on  other  implementation  considerations. 

Queries  may  take  a  long  time  to  execute,  and  the  conventional  execu¬ 
tion  time  could  be  reduced  greatly  if  the  queries  are  rephrased  according  to 
some  optimization  criteria  [UllmBO,  Ch.6],  As  a  rule  of  thumb,  the  general 
strategies  for  optimization  in  [UllmBO,  Ch.6]  are  also  valuable  on  the  CHiP 
computer.  In  particular,  wo  summarize  four  rules  for  rephrasing  algebraic 
expressions  in  orde1"  to  reduce  the  CHiP  region  or  the  total  processing  time. 

1.  Perform  restrictions  as  early  as  possible.  Restrictions  tend  to  make 
significant  data  items  sparse  so  that  more  join  operations  can  be  per¬ 
formed  by  using  POP-SORT.  (See  also  Strategy  5.) 


2.  Perform  projections  as  early  as  possible.  Projections  tend  to  reduce 
the  tuple  length,  therefore  reduce  the  amount  of  data  flow  in  CHiP  pro¬ 
cessor.  (See  also  Strategy  5.) 

3.  Cascade  restrictions  and  projections.  A  sequence  of  these  operations 
can  be  performed  all  at  a  once. 

4.  Combine  certain  restrictions  with  their  prior  Cartesian  products  into 
joins.  This  helps  controlling  the  size  of  intermediate  results.  Hopefully 
some  allocation  of  large  CHiP  regions  can  be  avoided. 

Among  the  equivalent  expressions  there  are  some  which  usually  take 
longer  time  than  the  others.  The  goal  of  rephrasing  an  expression  is  to  avoid 
those  more  time-consuming  ones.  The  first  two  strategies  are  feasible  by 
commuting  restriction  with  other  operations,  or  by  commuting  projection 
with  a  Cartesian  product,  join,  union,  or  intersection  (but  not  difference.) 
Strategy  4  is  in  a  sense  a  special  example  of  Strategy  1. 

The  way  in  which  a  particular  expression  is  evaluated  also  affects  the 
query  processing  time.  We  summarize  more  strategies  based  on  the  imple¬ 
mentation  considerations  in  the  following. 

5.  Perform  restrictions  and  projections  on  the  input  relations  on  the  mass 
storage  level.  To  reduce  the  input  size  and  the  amount  of  data  flow  in 
the  CHiP  processor,  the  restriction  and  projection  on  an  input  relation 
is  better  performed  on  the  mass  storage  level  using  the  approaches  as 
in  the  conventional  database  machines. 
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6.  Combine  restrictions  and  projections  with  other  operations  that  follow 
or  precede  them.  It  can  be  done  by  loading  the  restriction  predicates 
and  projection  attributes  on  the  processing  elements.  This  simplify  the 
allocation  of  CHiP  regions. 

7.  Delete  redundant  duplicate-removal.  Multiset  operands  are  allowed  to 
union,  intersection,  and  difference.  Remove-duplicates  before  these 
operations  is  thus  redundant. 

8.  Combine  a  sequence  of  unions.  A  single  run  of  POP-'  on  all  the 
operand  relations  would  complete  a  sequence  of  unions. 

9.  Parse  queries  into  operation  trees  in  a  weight-balanced  ¥  .ion  (or  pro¬ 

cess  small  relations  first.)  The  buddy  system  allocation  algorithms 
work  especially  well  when  buddies  are  about  of  the  same  sizes. 

10.  Load  input  relations  as  an  ensemble.  Input  relations  are  loaded 
together  onto  CHiP  processor  according  to  the  allocation  pattern.  I/O 
time  of  0(V9ar«a)  is  thus  guaranteed. 


Figure  6-6.  Weight-balanced  trees. 


Commutative  laws  and  associative  laws  for  unions,  intersections,  joins, 
or  Cartesian  products  are  the  weapons  that  we  may  use  to  parse  queries  in  a 
weight -balanced  fashion.  Strategy  8  presents  an  even  better  amelioration 
method  on  performing  a  sequence  of  unions.  It  is  feasible  because  of  the  ver¬ 
satility  of  POP-SORT  to  perform  union  on  multisets.  For  a  sequence  of  inter¬ 
sections  or  a  sequence  of  joins  (Cartesian  products),  Strategy  9  can  be 
applied  to  parse  the  operation  sequence  into  a  weight-balanced  tree.  Exam¬ 
ples  are  shown  in  Figure  6-6.  For  a  sequence  of  differences,  Strategy  9  is 
also  useful  because  of  the  equivalent  law:  For  any  i,  1  <i<fc, 

k 

Rx  A  /?2  A  •  •  •  A  Rk  =  /?!  A  •  •  •  A  Ri  A  (  ( j  Rj),  where  A  denotes  the  multiset 

;=i  +  l 


k 

operation  difference  with  left-to-right  precedence  and  (j  R}  denotes  the 

J=i  +  1 


union  of  multisets.  Assume  that  the  examples  in  Figure  6-6  show  the 
weight-balanced  parsing  of  a  sequence  of  differences.  It  is  interesting  to 
note  that  the  operation  nodes  on  the  path  from  tne  external  node  nx  to  the 
root  all  perform  differences  and  the  rest  all  perform  unions. 


6.4  Extensions 

Although  the  operation  trees  considered  in  Section  6.2  may  contain  join 
operations,  not  all  the  join  operations  work  well  using  POP-SORT  with  mesh 
interconnection  on  the  CHiP  computer.  Fqui-join  and  natural  join  are  more 
likely  to  work  than  other  join  operations.  However,  even  for  equi-join  or 
natural  join,  performance  may  degrade  due  to  the  hot  spots  problem. 


In  this  section  we  shall  present  a  method  of  performing  Cartesian  pro¬ 
duct  that  generates  the  result  relation  in  a  square  CHiP  region.  Join 
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operations  can  be  implemented,  at  worst,  as  Cartesian  products.  Quotient 
can  also  be  implemented  by  Cartesian  product,  difference,  and  projection. 
Adding  Cartesian  product  to  the  restricted  type  of  operation  trees,  we 
therefore  extend  the  query  embedding  techniques  presented  in  section  2  to 
evaluate  all  algebraic  operation  trees.  Similarly,  we  may  extend  further  to 
include  operations  other  than  algebraic  ones. 
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Figure  6-7.  Systolic  method  of  Cartesian  product. 

The  systolic  method  of  Cartesian  product  in  [KungBO]  produces  a  result 
relation  in  a  --  -shaped  region  (see  Figure  6-7).  In  order  to  simplify  the  com¬ 
position  of  Cartesian  product  with  other  operations  implemented  by  POP- 
SORT.  the  result  relation  needs  to  be  in  a  square  region  (  or  a  1:2  rectangu¬ 
lar  region.)  A  simple  modification  of  the  systolic  method  can  shape  the 
result  relations  in  square  regions.  However,  the  I/O  bandwidth  of  a  CHiP 
processor  is  assumed  proportional  to  its  perimeter.  We  shall  seek  a  faster 
algorithm  which  takes  advantage  of  the  I/O  ports  on  the  perimeter  process¬ 
ing  elements. 
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R  i  R  2 
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Figure  6-6.  Cartesian  product  in  a  square  region. 

Let  /?i  and  /?2  be  the  two  relations,  |/?ij  =  nlf  \RZ\  -  n2,  and  The 

square  region  to  hold  the  result  relation  requires  a  size  of  n,?,  where 

fij.  —  |Vn7^2  •  Notice  that  nl<nr<nz.  Assume  that  Vy.  =  k*nx  for  simpli¬ 
city.  First,  allocate  a  query  region  of  area  (n,.+2)%..  The  first  two  columns, 
called  the  processing  columns,  are  used  to  produce  result  tuples.  The  rest 
of  the  region  performs  only  left-to-right  shift  and  is  used  to  store  the  result 
relation.  The  following  algorithm  would  generate  the  Cartesian  product  of 
R i  and  Rz  in  a  square  region  (see  Figure  6-8)  in  0(n,.)  time. 

Algorithm  6-3:  Cartesian  product. 

1.  Load  k  copies  of  Rx  on  the  second  processing  column. 

2.  If  no  more  Rz  tuples  then  stop.  Otherwise,  load  another  column  of  Rz 
tuples  on  the  first  processing  column. 

3.  Rotate  n.j  steps  each  copy  of  R\  and  produce  nx  columns  of  result 
tuples.  Go  to  step  2. 

Given  any  algebraic  expression,  we  may  proceed  to  do  the  following  to 
evaluate  the  expression.  First,  rephrase  it  according  to  the  query  ameliora¬ 
tion  Strategies  1~4  summarized  in  Section  6.3.  The  rephrased  expression  is 
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then  parsed  into  an  operation  tree  possibily  containing  some  harder-than- 
sorting  operations  like  Cartesian  products,  differences,  and  some  joins. 
Those  operations,  not  belonging  to  the  restricted  type,  would  partition  the 
operation  tree  into  single  operation  nodes  and  subtrees.  Each  subtree  is 
then  of  the  restricted  type.  The  query  amelioration  strategies  in  Section  6.3 
and  the  query  embedding  techniques  in  Section  6.2  are  thus  applicable  to 
each  subtree.  Single  operation  nodes  can  be  implemented  by  the  method  of 
producing  a  Cartesian  product  in  a  square  region.  For  example,  join  opera¬ 
tions  that  are  not  suitable  for  the  easy-catch  implementation  can  be  imple¬ 
mented  this  way.  Significant  data  items  can  then  be  "cornered”  to  a  smaller 
square  region  by  performing  the  bitonic  POP-SORT. 

To  process  any  algebraic  query,  the  composition  of  algorithms  is  no 
longer  automatically  done  by  the  buddy  system  allocation.  Composing  algo¬ 
rithms,  in  a  most  general  sense,  thus  becomes  a  three  level  approach. 
Ranked  in  the  order  of  preference,  they  are:  (l)  buddy  system  allocation,  (2) 
the  general  scheme  shown  in  Section  6.1,  and  (3)  off-CHiP  processing,  the 
last  resort. 
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CHAPTER  7 

SUMMARY  AND  CONCLUSIONS 


This  thesis  applies  highly  parallel  database  machines  to  improve  rela¬ 
tional  database  processing.  The  database  machines  are  dedicated  comput¬ 
ers  enhanced  with  highly  parallel  processing  capability.  Regularity  and  uni¬ 
formity  are  mandatory  for  achieving  high-performance  and  cost- 
effectiveness.  This  work  first  deals  with  unifying  several  relational  opera¬ 
tions  on  a  regular  sorting  algorithm. 

Given  a  highly  parallel  processor,  we  have  shown  that  any  sorting  algo¬ 
rithm  designed  to  execute  on  the  processor  can  be  easily  modified  to 
become  a  primitive  operation  POP-SORT.  The  primitive  operation  can 
efficiently  perform  sorting,  duplicate-removal,  union,  intersection,  and 
difference.  It  can  also  be  used  to  perform  join  operations  requiring  no  more 
than  linear  post-sorting  processing  time. 

This  thesis  then  applies  an  instance  of  POP-SORT,  the  bitonic  POP-SORT, 
on  the  CHiP  processors  to  process  whole  queries.  Query  embedding 
presents  a  methodology  which  embeds  appropriate  interconnections  for 
processing  whole  queries  on  the  CHiP  processors.  Due  to  some  interesting 
characteristics  of  the  bitonic  POP-SORT,  query  embedding  is  simplified 
significantly. 


In  Section  7.1  we  summarize  the  main  contributions  of  this  thesis.  To 

apply  the  results  of  this  work  successfully  to  a  complete  design  of  back-end 

system,  several  important  issues  need  to  be  investigated  further.  Section 

7.2  discusses  briefly  those  important  issues. 

7.1  Main  Contributions 

The  main  contributions  of  this  thesis  are  summarized  in  the  following. 

1.  The  methodology  of  applying  sorting  to  solve  other  database  operations 
is  presented  for  highly  parallel  situations.  Two  techniques  are  shown  to 
adapt,  with  negligible  overhead,  merge-oriented  and  other  sorting 
methods  to  solve  several  database  operations  (POP-SORT). 

2.  The  efficiency  of  POP-SORT  is  studied.  POP-SORT  based  on  an  optimal 
sorting  method  is  proved  to  be  also  optimal  for  performing  duplicate- 
removal,  union,  intersection,  and  difference  for  a  reasonable  class  of 
homogeneous  comparison  computation. 

3.  The  join  system  which  employs  a  halting  mechanism  can  terminate  join 
operations  in  sublinear  time  after  the  argument  relations  are  pre¬ 
conditioned  by  POP-SORT. 

4.  The  algorithm  Sprinkle  can  efficiently  redistribute  data  items  such  that 
they  are  almost  evenly  distributed  over  the  processing  elements. 

5.  The  bitonic  POP-SORT  which  generalizes  Batcher’s  bitonic  sort  to 
become  a  powerful  primitive  defines  the  (new)  upper  time  bound  for  per¬ 
forming  each  of  the  five  database  operations  -  sorting,  duplicate- 
removal,  union,  intersection,  and  difference. 


6.  The  bitonic  sort  on  the  mesh-connected  computers  in  [Thom77,  Nass79] 
can  be  improved  by  a  s  seed-up  factor  up  to  w*c  with  mesh-like  inter¬ 
connections  on  the  CHiP  computers. 

7.  Efficient  algorithms  for  reordering  data  items  among  three  major  index¬ 
ing  schemes  and  sortin’  km  data  items  on  n  PEs  are  proposed  and 
analyzed. 

8.  Query  embedding  is  to  exploit  all  the  possible  parallelism  in  processing 
whole  queries.  With  the  use  of  the  bitonic  POP-SORT,  query  embedding 
for  a  restricted  type  of  queries  is  simple  and  straightforward.  The  algo¬ 
rithm  to  produce  Carte:  ian  products  in  square  regions  further  extends 
the  restricted  query  embedding  to  processing  all  the  queries. 

9  The  lacing  technique  is  shown  to  exploit  the  maximum  number  of  data 
path3  provided  by  the  switch  corridors  on  the  CHiP  computers. 

7.2  Future  Research 

In  this  thesis  we  have  c  mcentrated  on  exploring  parallelism  in  process¬ 
ing  relational  queries  with  t  he  use  of  highly  parallel  processors.  There  are 
other  issues  needed  to  be  investigated  to  complete  a  reliable  design  of  a 
highly  parallel  database  machine. 

The  I/O  bandwidth  between  the  mass  storage  and  the  highly  parallel 
processor  is  required  to  be  arge  to  prevent  the  processor  from  data  starva¬ 
tion.  The  mass  storage  shoold  be  content  addressable  such  that  searching 
and  update  can  be  perform  >>d  on  the  storage  level  efficiently.  The  design  of 
hardware  organizations  tc  implement  the  requirements  of  large  I/O 


bandwidth  and  content  addressability  is  thus  important.  More  pressing,  a 
storage  model  and  an  I/O  model  for  near  future  technologies  are  necessary 
to  measure  I/O  complexities  and  design  problem  decomposition  algorithms. 

Given  a  highly  parallel  processor  with  n  PEs,  we  addressed  the  problem 
of  partitioning  the  processor  to  perform  several  small  jobs.  We  also 
presented  algorithms  to  allow  the  total  size  of  argument  relations  to  be  k*n 
if  PEs  have  local  memory  space  k .  However,  problems  with  sizes  larger  than 
k*n  must  be  decomposed  into  several  small  ones.  Fortunately  the  decom¬ 
position  problem  is  reduced  to  an  external  sorting  problem  for  a  family  of 
queries. 

The  back-end  is  dedicated  to  perform  database  management  functions. 
With  new  hardware  technologies  and  architectures  the  traditional  designs  of 
database  management  need  to  be  reconsidered.  Programming  the  highly 
parallel  processor  is  another  important  issue.  With  the  unification  proposed 
in  this  research  work  the  programming  difficulty  should  be  reduced. 
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APPENDIX  A 

BATCHER’S  BITONIC  SORT 


A.1  The  Bitonic  Merge 

Batcher's  bitonic  sort  [Batc68]  is  based  on  his  bitonic  merge  algorithm 

which  sorts  bitonic  sequences.  A  sequence  x0.  x1 . xn_x  is  said  to  be 

bitonic  if  either 

(1)  there  is  an  index  i,  0£i<n-l,  such  that  x0<xx<.. 

or 

(2)  the  sequence  can  be  shifted  cyclically  so  that  condition  1  is  satisfied 
[Ston71], 

The  bitonic  merge  algorithm  applies  logn  parallel  comparison  steps.  Each 
step  partitions  a  bitonic  sequence  into  low  and  high  bitonic  sequences  such 
that  every  item  in  the  low  sequence  is  no  larger  than  any  one  in  the  high 
sequence.  The  correctness  of  the  bitonic  merge  algorithm  was  proved  in 
[Batc68]  and  described  as  the  following  theorem  in  [Ston71]. 

Batcher’s  Theorem  Let  the  sequence  x0,  xlt  ....  xn_i  be  bitonic  and 
a<  =  minfo,  (q  =  max(xi,  2i+n/g)  for  0<i£n/2.  The  two 

sequences  o0*  ai . “n/z-i  and  b0*  b i-  •••*  bn/z-i  are  both  bitonic.  and 

cq  <  bj  for  all  i  and  j. 


no 
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Figure  A-l.  Sorting  network  for  Batcher's  bitonic  sort. 


Batcher’s  bitonic  sort  applies  his  bitonic  merge  algorithm  for  logn 
times.  It  can  be  described  as  a  sorting  network  shown  in  Figure  A-l  [Knut73, 
p.237].  There  are  logn  bitonic  merge  stages  and  log*  comparison  steps  at 
the  i-th  stage.  Several  adapted  versions  of  the  bitonic  sort  with  different 
processor  interconnections  are  available. 


•  The  bitonic  sort  can  be  done  in  time  (log8n)f/j  +  ^-(log^  +  logn) t J 
with  the  perfect  shuffle  interconnection  [Ston71], 


•  On  a  mesh-connected  computer,  the  bitonic  sort  can  be  done  in  time 
(14(Vn  -l)-41ogn)<fl  +  — {\og3n  +  log n)tc  if  the  sequence  is  to  be 
sorted  into  the  shuffled  row-major  order  [Thom77]. 


t  The  actual  time  required  for  one  routing  step  or  one  comparison  step  depends  on  different 
interconnections  or  machines. 


•  On  mesh-connected  computers,  the  bitonic  sort  can  be  done  in  time 

(14(Vn  -l)-41ogn)f*  +  — (log2*.  +  logn)fc  +  (|-logzn  +  ~~logn)tf  if 

2  O  4 

the  sequence  is  to  be  sorted  into  the  row-major  or  the  snake-like  row- 
major  order  [Nass79]. 

•  With  the  CCC  (or  the  k-Cube)  interconnection  the  bitonic  sort  can  be 
done  in  time  0( log2n)(</?  +  tc)  [PrepBl], 

A.  2  Effects  of  Propagation  Delay 

In  practice,  sending  data  from  a  location  to  another  location  always 
takes  time.  The  speed  of  data  propagation  in  the  chip  is  many  orders  of 
magnitude  slower  than  the  velocity  of  light  [MeadBO].  For  circuit  perfor¬ 
mance,  the  propagation  delay  in  long  wires  is  especially  important.  How¬ 
ever,  it  is  still  controversial  how  the  propagation  delay  affects  che  VLSI  com¬ 
putation  time.  Depending  on  assumptions,  the  propagation  delay  function 
p(x)  varies  widely:  O(logx)  <  p  (z)  £  0(x2),  where  x  is  the  wire  length 
[PateBl]. 

Three  interesting  propagation  delay  models  discussed  in  [PateBl, 
BilaBl]  are  described  as  follows.  It  is  plausible  that  R-C  circuits  define  a 
quadratic  propagation  delay  function.  Since  resistance  and  capacitance 
both  grow  linearly  with  the  wire  length,  the  time  constant  of  the  transistor 
load  is  thus  a  quadratic  function  of  the  wire  length.  However,  the  propaga¬ 
tion  delay  is  about  defined  by  a  linear  function,  provided  repeaters  are 
added  to  the  long  wires.  If  special  drivers  are  used  to  speed  capacitance 

$  tj  denotes  the  time  required  to  interchange  the  contents  of  two  registers. 


charging  then  the  minimum  delay,  p(x)  =  logx,  is  achievable  [MeadBO]. 


According  to  [Bila8l],  both  current  and  projected  silicon  technologies 
fall  within  the  realm  of  the  logarithmic  propagation  delay  function.  However, 
the  special  drivers  cannot  be  unlimitedly  applied  to  arbitrarily  long  wires 
due  to  the  limitations  on  the  current  density.  Thus  the  most  realistic  esti¬ 
mate  is  perhaps  P{x)  =  Vx  [PateBl]. 

In  analyzing  the  asymptotic  time  complexities,  one  should  be  particu¬ 
larly  careful  about  the  effect  of  propagation  delay  since  the  maximum  wire 
length  may  grow  with  the  problem  size.  Taking  the  propagation  delay  into 
account,  we  recompute  here  the  complexities  for  the  bitonic  sort  with 
different  processor  interconnections.  We  consider  only  the  communication 
time  because  the  computation  time  is  not  so  susceptible  to  the  propagation 
delay  as  the  former.  The  following  table  summarizes  the  results  of  our 
recomputation 

Table  A-l.  Effects  of  propagation  delay  on  the  bitonic 
sort  with  different  interconnections. 


p(x) 

1 

Cl  logx 

c2  Vx 

shuffle 

l0g87l 

c i log3n 

c2Vn  logn 

mesh 

Vn 

Vn 

Vn 

CHIP 

fvs- 

f-V7T 

fV?T 

The  average  length  of  the  edges  in  a  planar  embedding  of  the  shuffle- 
exchange  graph  is  O(n/logzn)  [ThomBO].  Ifp(x)  =  c  jlogx  then  the  compu¬ 
tation  time  is  Cjlog(n/log8n)#log2n  which  is  approximately  c,log3n.  If 
p(x)  =  c2Vx  then  the  computation  time  is  c 2 Vn / log2n * log2n  which  is 


<?2  Vn  logn.  Other  powerful  interconnections  like  the  CCC  and  the  k-Cube 
should  be  as  susceptible  to  the  propagation  delay  as  the  shuffle-exchange. 

On  the  CHiP  computers,  the  technique  of  * -location  jumps  can  be  used 
to  improve  the  data  routing  on  the  mesh-connected  computers  (Section 
5.3).  The  improvement  asymptotically  achieves  a  speed-up  factor  up  to  z. 
z  <  t u»c ,  where  w  is  the  corridor  width  and  c  the  cross-over  capability  of 
the  switches.  Let  s  denote  the  speed-up  factor,  and  thus  s  £  z  £  wc .  In 
the  table,  we  introduce  another  factor  a  which  denotes  the  propagation 
delay  required  by  the  z  -location  jumps.  Due  to  the  practical  consideration 
of  high  utilization  of  components,  w*c  is  bounded  by  a  small  constant,  say 
32.  The  propagation  delay  of  the  z -location  jumps  is  thus  relatively  con¬ 
stant.  Assuming  that  1-location  jumps  take  unit  time,  the  factor  a  should  be 
small,  a  -*  1.  Hence  we  do  not  distinguish  two  a's  for  the  two  non-constant 
propagation  delay  functions.  The  CHiP  computers  may  also  embed  the 
shuffle-exchange  interconnection,  but  the  effect  of  propagation  delay  will  be 
more  severe  than  that  on  the  shuffle-exchange  as  shown  in  the  table. 


APPENDIX  B 


SPRINKLE  ALGORITHM 


Given  n  processing  elements  PEq,  PE\ ,  ....  PEn-i  and  a  sequence  of 
data  items  distributed  over  the  processing  elements.  The  quantity  of  data 
items  is  not  evenly  distributed;  there  are  X*  items  at  PEi  for  all  i  in  [O.n-l]. 
The  Sprinkle  Algorithm  is  designed  to  redistribute  the  data  items  so  that  the 
sequence  is  approximately  equally  distributed  over  the  processing  elements. 
1  Vi1  a  1  n_l  «s 

That  is,  — Yxi  <  xt  <  —  Y  Xi  ,  where  x*  is  the  number  of  data  items  at 

71  i-0  71  i=0 

PEi  after  the  redistribution. 


step  1  step  2  step  3 

0 
1 
2 

3 

4 

5 

e 

7 

Figure  B-l.  The  communication  scheme  of  logn  steps 
applied  in  the  Sprinkle  Algorithm  for  n  =  6. 


The  redistribution  problem  is  more  difficult  than  the  problem  of  finding 
the  average  of  the  number  sequence  {x<|  which  can  be  best  done  in  0(logn) 
steps.  However,  a  communication  scheme  of  O(logn)  steps  as  shown  in 


Figure  B-l  is  still  sufficient  for  the  redistribution  problem.  In  Figure  B-l. 
each  arrow  denotes  an  operation  that  redistributes  the  data  items  at  the 
two  PEs.  After  the  operation,  the  two  PEs  both  have  the  same  number  of 
data  items,  or  the  one  pointed  by  the  arrow  head  has  one  more  data  item, 
•n  the  following  example  we  show  how  important  the  directions  of  the  arrows 


Example.  To  redistribute  data  items  among  four  PEs,  the  following 
figures  shows  (a)  the  communication  scheme  leading  to  the  correct 
result,  and  (b)  *  communication  scheme  possibly  leading  to  wrong 
results. 


Si  h 

l2J  0 


The  Sprinkle  Algorithm  repeatedly  compares  the  numbers  of  data  items 
between  pairs  of  PEs  and  then  ships  data  items  to  redistribute  them  approx¬ 
imately  evenly  between  the  pairs  of  PEs.  It  involves  logn  computation  steps 
to  determine  the  redistribution  strategies.  In  addition,  the  data  shipment 
requires  more  communication  time  which  depends  on  the  original  distribu¬ 
tion  and  the  available  processor  interconnections.  Assume  that 

0  <  Xi  <  k ,  where  k  is  a  small  constant.  The  Sprinkle  Algorithm  needs  at 
k 

most  (— +l)#logn  tR  communication  time  if  the  appropriate  interconnec- 
tion  is  provided.  With  the  mesh  interconnection,  the  communication  time 

U 

required  is  no  more  than  (— +l)*4Vn  tR. 


The  communication  scheme  in  Figure  B-l  is  similar  to  and  is  actually  a 
portion  of  that  needed  in  the  bitonic  sort  (Appendix  A).  Programming  the 
CHiP  computers  to  perform  the  Sprinkle  Algorithm  thus  does  not  demand 
much  additional  effort. 


On  the  CHiP  computers,  the  maximum  number  of  data  paths  allowed  to 
cross  the  corridors  is  bounded  by  wc ,  with  the  corridor  width  w  and 
cross-over  capability  c  of  the  switches.  If  d  >  2c,  where  d  is  the  degree  of 
incident  data  paths  to  the  switches,  then  the  maximum  bandwidth  w*c  is 
feasible  with  an  embedding  technique  called  lacing.  The  technique  is  to 
embed  straight  data  paths  as  well  as  zig-zag  ones  that  exploit  the  maximum 
bandwidth.  Here  we  show  the  lacing  technique  by  an  example  of  embedding 
the  perfect  shuffle  interconnection  on  a  switch  lattice. 


Figure  C-l.  The  schematic  perfect  shuffle  of  n  data  items 
between  two  rows  of  n/2  processors,  n  =  16. 

Figure  C-l  shows  the  n  perfect  shuffle  connections  between  two  rows  of 
n/2  processing  elements  for  n  =  16.  Notice  that  there  are  n/2  connections 
passing  through  the  dotted  bisection  line.  To  embed  the  interconnection  on 

Tt  /  2 

the  CHiP  computers,  - horizontal  corridors  are  needed.  If  n  =  32  then 


two  corridors  are  needed  to  host  the  interconnection  on  the  switch  lattice  of 
w  =  4,  c  =  2,  and  d  =  8.  Figure  C-2  shows  the  embedding  of  the  perfect 
shuffle  interconnection  for  n  =  32.  Figure  C-3  depicts  some  basic  com¬ 
ponents  which  construct  the  embedding  in  Figure  C-2. 
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Figure  C-3.  Some  basic  components  constructing  the 
embedding  in  Figure  C-2. 


The  four  little  pieces  shown  in  Figure  C-3(b)  exploit  the  n/2  possible 

•n  /  P 

data  paths  passing  through  the  bisection  line  in  — —  horizontal  corridors. 

XLItC 

This  lacing  technique  can  be  generalized  to  embed  the  perfect  shuffle  for 
larger  values  of  n  and  on  other  switch  lattices.  The  two  data  per  processing 
element  structure  excludes  the  necessity  of  exchange  edges  as  in  the 
3huffle-exchange  graph.  The  embedding  in  Figure  C-2  can  be  extended  to 
build  multistage  bitonic  sorters  as  in  [Batc68]  or  unistage  bitonic  sorters  as 
in  [Ston71]  on  the  CHiP  computers. 
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